All data used in the study are from the The Cancer Genome Atlas (TCGA) program, which includes a rich body of imaging, clinical, and molecular data from 11,315 cases of 33 different cancer types (Weinstein et al., Nat Genet 2013). The data are made available by the National Cancer Institute (NCI) Genomic Data Commons (GDC) information system, publicly accessible at the GDC Data Portal.
Section Download below details the procedures used to download the raw data. After download, the raw data were preprocessed in preparation for modeling using the following dedicated Jupyter notebooks:
- Clinical data - Jupyter notebook
- Omics data - Jupyter notebook
- WSIs - Jupyter notebook
•
Clinical data
•
Gene expression
•
miRNA expression
•
DNA methylation
•
Copy number variation
•
Whole-slide images (WSI)
Clinical data was downloaded using the TCGAbiolinks
R package as detailed below. All other modalities were downloaded from the GDC Data Portal for all cancer entities in the TCGA program using the
GDC Data Transfer Tool (docs).
This takes an appropriate Manifest file as input, obtained according to the following general procedure:
- Go to GDC Data Portal;
- Navigate to Repository and check box for
TCGA
program in theCases
tab; - Filter the data (using the interactive pie charts or check boxes);
- Use check boxes on the left to select the appropriate experimental strategies;
- Push
Manifest
button to download file.
The generated Manifest files are stored here in the data
folder. The code used to download the data from the respective manifest file is reproduced below for each data modality.
Used TCGAbiolinks
R package to access clinical data using the following R code.
# Download data for all TCGA projects
project_ids <- stringr::str_subset(TCGAbiolinks::getGDCprojects()$project_id, 'TCGA')
data <- list()
for (project_id in project_ids) {
data[[project_id]] <- TCGAbiolinks::GDCquery_clinic(project=project_id, type='clinical')
}
# Merge into single table
# (the "disease" column identifies each original table)
data <- do.call(dplyr::bind_rows, data)
# Write to file
output_path <- '/mnt/dataA/TCGA/raw/clinical_data.tsv'
readr::write_tsv(data, output_path)
The data are provided either as read counts or FPKM/FPKM-UQ. FPKM is designed for within-sample gene comparisons and has actually fallen out of favor since the normalized gene values it produces do not add up to one million exactly. In practice, however, the deviation from one million is not dramatic and it often works well enough. Given that normalizing such a large number of samples is challenging, here I will use the FPKM-UQ data.
- Transcriptome profiling > RNA-seq > HTSeq - Counts (11'093 files; 2.8 Gb)
$ mkdir /mnt/dataA/TCGA/RNA-seq_HTSeq_counts
$ sudo /opt/gdc-client download \
-d /mnt/dataA/TCGA/raw/RNA-seq_HTSeq_counts/ \
-m data/gdc_manifest.2019-08-21.txt
- Transcriptome profiling > RNA-seq > HTSeq - FPKM-UQ (11'093 files; 5.78 Gb)
$ mkdir /mnt/dataA/TCGA/raw/RNA-seq_FPKM-UQ
$ sudo /opt/gdc-client download \
-d /mnt/dataA/TCGA/raw/RNA-seq_FPKM-UQ/ \
-m data/gdc_manifest.2019-08-23.txt
A description of the mRNA expression data analysis pipeline can be found in the docs page.
The data are provided in tables including both read counts and counts per million mapped miRNA (RPM). RPM should be appropriate for the current project.
See the docs page for a description of the analaysis pipeline.
- Transcriptome profiling > miRNA Expression Quantification (11'082 files; 557.23 Mb)
$ mkdir /mnt/dataA/TCGA/raw/miRNA-seq
$ sudo /opt/gdc-client download \
-d /mnt/dataA/TCGA/raw/miRNA-seq/ \
-m data/gdc_manifest.2019-08-22.txt
The data are provided in tables of array results of the level of methylation at known CpG sites. They include unique ids for the array probes and methylation Beta values, representing the ratio between the methylated array intensity and total array intensity (falls between 0, lower levels of methylation, and 1, higher levels of methylation).
See the docs page for a description of the analaysis pipeline.
- Transcriptome profiling > Methylation Array (12'359 files; 1.4 Tb)
$ mkdir /mnt/dataA/TCGA/raw/Methylation
$ sudo /opt/gdc-client download \
-d /mnt/dataA/TCGA/raw/Methylation/ \
-m data/gdc_manifest.2019-09-09.txt
The data are provided in a table for each of the 33 cancer entities with a column per patient and a row for each of 19'729 protein coding genes. Copy number variation (CNV) values are represented as 0, 1, or -1 for each gene, corresponding to "neutral", "gain" or "loss".
See the docs page for a description of the analaysis pipeline.
- Copy Number Variation > Gene Level Copy Number Scores (33 files; 474.29 Mb)
$ mkdir /mnt/dataA/TCGA/raw/CNV
$ sudo /opt/gdc-client download \
-d /mnt/dataA/TCGA/raw/CNV/ \
-m data/gdc_manifest.2019-09-26.txt
The total data can be accessed as follows:
- Slide image > Diagnostic slide (11'766 files; 12.95 Tb)
$ mkdir /net/data/Projects/imaging_genomics/TCGA_BRCA/diagnostic_slide/
$ /net/gdc-client download \
-d /net/data/Projects/imaging_genomics/TCGA_BRCA/diagnostic_slide \
-m /net/imaging_genomics/data/gdc_manifest.2019-05-24_Diagnostic_slide.txt