Name		Name	Last commit message	Last commit date
parent directory ..
DNAm_probes.tsv		DNAm_probes.tsv
README.md		README.md
clinical_data.tsv		clinical_data.tsv
gdc_manifest.2019-08-22.txt		gdc_manifest.2019-08-22.txt
gdc_manifest.2019-08-23.txt		gdc_manifest.2019-08-23.txt
gdc_manifest.2019-09-03_all_but_BRCA.txt		gdc_manifest.2019-09-03_all_but_BRCA.txt
gdc_manifest.2019-09-09.txt		gdc_manifest.2019-09-09.txt
gdc_manifest.2019-09-26.txt		gdc_manifest.2019-09-26.txt
gdc_manifest_20200512_114040_example_WSIs.txt		gdc_manifest_20200512_114040_example_WSIs.txt
labels.tsv		labels.tsv
preprocess_clinical.ipynb		preprocess_clinical.ipynb
preprocess_omics.ipynb		preprocess_omics.ipynb
preprocess_wsi.ipynb		preprocess_wsi.ipynb
wsi_magnifications.json		wsi_magnifications.json

README.md

Data

All data used in the study are from the The Cancer Genome Atlas (TCGA) program, which includes a rich body of imaging, clinical, and molecular data from 11,315 cases of 33 different cancer types (Weinstein et al., Nat Genet 2013). The data are made available by the National Cancer Institute (NCI) Genomic Data Commons (GDC) information system, publicly accessible at the GDC Data Portal.

Section Download below details the procedures used to download the raw data. After download, the raw data were preprocessed in preparation for modeling using the following dedicated Jupyter notebooks:

Clinical data - Jupyter notebook
Omics data - Jupyter notebook
WSIs - Jupyter notebook

Download

• Clinical data
• Gene expression
• miRNA expression
• DNA methylation
• Copy number variation
• Whole-slide images (WSI)

Clinical data was downloaded using the TCGAbiolinks R package as detailed below. All other modalities were downloaded from the GDC Data Portal for all cancer entities in the TCGA program using the GDC Data Transfer Tool (docs). This takes an appropriate Manifest file as input, obtained according to the following general procedure:

Go to GDC Data Portal;
Navigate to Repository and check box for TCGA program in the Cases tab;
Filter the data (using the interactive pie charts or check boxes);
Use check boxes on the left to select the appropriate experimental strategies;
Push Manifest button to download file.

The generated Manifest files are stored here in the data folder. The code used to download the data from the respective manifest file is reproduced below for each data modality.

Clinical data

Used TCGAbiolinks R package to access clinical data using the following R code.

# Download data for all TCGA projects
project_ids <- stringr::str_subset(TCGAbiolinks::getGDCprojects()$project_id, 'TCGA')

data <- list()

for (project_id in project_ids) {
    data[[project_id]] <- TCGAbiolinks::GDCquery_clinic(project=project_id, type='clinical')
}

# Merge into single table
# (the "disease" column identifies each original table)
data <- do.call(dplyr::bind_rows, data)

# Write to file
output_path <- '/mnt/dataA/TCGA/raw/clinical_data.tsv'
readr::write_tsv(data, output_path)

Gene expression

The data are provided either as read counts or FPKM/FPKM-UQ. FPKM is designed for within-sample gene comparisons and has actually fallen out of favor since the normalized gene values it produces do not add up to one million exactly. In practice, however, the deviation from one million is not dramatic and it often works well enough. Given that normalizing such a large number of samples is challenging, here I will use the FPKM-UQ data.

Transcriptome profiling > RNA-seq > HTSeq - Counts (11'093 files; 2.8 Gb)

$ mkdir /mnt/dataA/TCGA/RNA-seq_HTSeq_counts
$ sudo /opt/gdc-client download \
  -d /mnt/dataA/TCGA/raw/RNA-seq_HTSeq_counts/ \
  -m data/gdc_manifest.2019-08-21.txt

Transcriptome profiling > RNA-seq > HTSeq - FPKM-UQ (11'093 files; 5.78 Gb)

$ mkdir /mnt/dataA/TCGA/raw/RNA-seq_FPKM-UQ
$ sudo /opt/gdc-client download \
  -d /mnt/dataA/TCGA/raw/RNA-seq_FPKM-UQ/ \
  -m data/gdc_manifest.2019-08-23.txt

A description of the mRNA expression data analysis pipeline can be found in the docs page.

miRNA expression

The data are provided in tables including both read counts and counts per million mapped miRNA (RPM). RPM should be appropriate for the current project.

See the docs page for a description of the analaysis pipeline.

Transcriptome profiling > miRNA Expression Quantification (11'082 files; 557.23 Mb)

$ mkdir /mnt/dataA/TCGA/raw/miRNA-seq
$ sudo /opt/gdc-client download \
  -d /mnt/dataA/TCGA/raw/miRNA-seq/ \
  -m data/gdc_manifest.2019-08-22.txt

DNA methylation

The data are provided in tables of array results of the level of methylation at known CpG sites. They include unique ids for the array probes and methylation Beta values, representing the ratio between the methylated array intensity and total array intensity (falls between 0, lower levels of methylation, and 1, higher levels of methylation).

See the docs page for a description of the analaysis pipeline.

Transcriptome profiling > Methylation Array (12'359 files; 1.4 Tb)

$ mkdir /mnt/dataA/TCGA/raw/Methylation
$ sudo /opt/gdc-client download \
  -d /mnt/dataA/TCGA/raw/Methylation/ \
  -m data/gdc_manifest.2019-09-09.txt

Copy number variation

The data are provided in a table for each of the 33 cancer entities with a column per patient and a row for each of 19'729 protein coding genes. Copy number variation (CNV) values are represented as 0, 1, or -1 for each gene, corresponding to "neutral", "gain" or "loss".

See the docs page for a description of the analaysis pipeline.

Copy Number Variation > Gene Level Copy Number Scores (33 files; 474.29 Mb)

$ mkdir /mnt/dataA/TCGA/raw/CNV
$ sudo /opt/gdc-client download \
  -d /mnt/dataA/TCGA/raw/CNV/ \
  -m data/gdc_manifest.2019-09-26.txt

Whole-slide images (WSI)

The total data can be accessed as follows:

Slide image > Diagnostic slide (11'766 files; 12.95 Tb)

$ mkdir /net/data/Projects/imaging_genomics/TCGA_BRCA/diagnostic_slide/
$ /net/gdc-client download \
  -d /net/data/Projects/imaging_genomics/TCGA_BRCA/diagnostic_slide \
  -m /net/imaging_genomics/data/gdc_manifest.2019-05-24_Diagnostic_slide.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data

Download

Clinical data

Gene expression

miRNA expression

DNA methylation

Copy number variation

Whole-slide images (WSI)

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data

Download

Clinical data

Gene expression

miRNA expression

DNA methylation

Copy number variation

Whole-slide images (WSI)