Skip to content

Commit

Permalink
Merge pull request #160 from umccr/readme_updates
Browse files Browse the repository at this point in the history
Update links in the Readme
  • Loading branch information
skanwal committed Sep 3, 2024
2 parents d0f5937 + bd395a5 commit bf35b42
Show file tree
Hide file tree
Showing 5 changed files with 471 additions and 34 deletions.
17 changes: 8 additions & 9 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,7 @@ knitr::opts_chunk$set(

`RNAsum` is an R package that can post-process, summarise and visualise
outputs primarily from [DRAGEN RNA][dragen-rna] pipelines.
Its main application is to complement genome-based findings from the
[umccrise][umccrise] pipeline and to provide additional evidence for detected
Its main application is to complement whole-genome based findings and to provide additional evidence for detected
alterations.

[dragen-rna]: <https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html>
Expand Down Expand Up @@ -59,7 +58,7 @@ docker pull ghcr.io/umccr/rnasum:latest
## Workflow

The pipeline consists of five main components illustrated and briefly
described below. For more details, see [workflow.md](/workflow.md).
described below. For more details, see [workflow.md](./inst/articles/workflow.md).

<img src="man/figures/RNAsum_workflow_updated.png" width="100%">

Expand All @@ -81,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md).
potential druggable targets.
5. The final product is an interactive HTML report with searchable tables and
plots presenting expression levels of the genes of interest. The report
consists of several sections described [here](./articles/report_structure.md).
consists of several sections described [here](./inst/articles/report_structure.md).

## Reference data

Expand All @@ -100,10 +99,10 @@ Depending on the tissue from which the patient's sample was taken, one of
**33 cancer datasets** from TCGA can be used as a reference cohort for comparing
expression changes in genes of interest of the patient. Additionally, 10 samples
from each of the 33 TCGA datasets were combined to create the
**[Pan-Cancer dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**,
and for some cohorts **[extended sets](./articles/tcga_projects_summary.md#extended-datasets)**
**[Pan-Cancer dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)**
are also available. All available datasets are listed in the
**[TCGA projects summary table](./articles/tcga_projects_summary.md)**. These datasets
**[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets
have been processed using methods described in the
[TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
repository. The dataset of interest can be specified by using one of the
Expand All @@ -119,7 +118,7 @@ analytical pipelines. Moreover, TCGA data may include samples from tissue
material of lower quality and cellularity compared to samples processed using
local protocols. To address these issues, we have built a high-quality internal
reference cohort processed using the same pipelines as input data
(see [data pre-processing](./articles/workflow.md#data-processing)).
(see [data pre-processing](./inst/articles/workflow.md#data-processing)).

This internal reference set of **40 pancreatic cancer samples** is based on WTS
data generated at **[UMCCR](https://research.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group)**
Expand Down Expand Up @@ -359,7 +358,7 @@ sections, including:
\*\* if genome-based results are available; see `--umccrise` argument

Detailed description of the report structure, including result prioritisation
and visualisation is available [here](report_structure.md).
and visualisation is available [here](./inst/articles/report_structure.md).

#### Results

Expand Down
50 changes: 25 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,8 @@
`RNAsum` is an R package that can post-process, summarise and visualise
outputs primarily from [DRAGEN
RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html)
pipelines. Its main application is to complement genome-based findings
from the [umccrise](https://github.com/umccr/umccrise) pipeline and to
provide additional evidence for detected alterations.
pipelines. Its main application is to complement whole-genome based
findings and to provide additional evidence for detected alterations.

**DOCS**: <https://umccr.github.io/RNAsum>

Expand Down Expand Up @@ -54,7 +53,8 @@ docker pull ghcr.io/umccr/rnasum:latest
## Workflow

The pipeline consists of five main components illustrated and briefly
described below. For more details, see [workflow.md](/workflow.md).
described below. For more details, see
[workflow.md](./inst/articles/workflow.md).

<img src="man/figures/RNAsum_workflow_updated.png" width="100%">

Expand All @@ -80,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md).
5. The final product is an interactive HTML report with searchable
tables and plots presenting expression levels of the genes of
interest. The report consists of several sections described
[here](./articles/report_structure.md).
[here](./inst/articles/report_structure.md).

## Reference data

Expand All @@ -101,12 +101,12 @@ of **33 cancer datasets** from TCGA can be used as a reference cohort
for comparing expression changes in genes of interest of the patient.
Additionally, 10 samples from each of the 33 TCGA datasets were combined
to create the **[Pan-Cancer
dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, and
for some cohorts **[extended
sets](./articles/tcga_projects_summary.md#extended-datasets)** are also
available. All available datasets are listed in the **[TCGA projects
summary table](./articles/tcga_projects_summary.md)**. These datasets
have been processed using methods described in the
dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
and for some cohorts **[extended
sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are
also available. All available datasets are listed in the **[TCGA
projects summary table](./inst/articles/tcga_projects_summary.md)**.
These datasets have been processed using methods described in the
[TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
repository. The dataset of interest can be specified by using one of the
TCGA project IDs for the `RNAsum` `--dataset` argument (see
Expand All @@ -122,7 +122,7 @@ may include samples from tissue material of lower quality and
cellularity compared to samples processed using local protocols. To
address these issues, we have built a high-quality internal reference
cohort processed using the same pipelines as input data (see [data
pre-processing](./articles/workflow.md#data-processing)).
pre-processing](./inst/articles/workflow.md#data-processing)).

This internal reference set of **40 pancreatic cancer samples** is based
on WTS data generated at
Expand Down Expand Up @@ -170,24 +170,24 @@ quantification file.

The table below lists all input data accepted in `RNAsum`:

| Input file | Tool | Example | Required |
|----|----|----|----|
| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** |
| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** |
| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
| Input file | Tool | Example | Required |
|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------|
| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** |
| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** |
| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |

### WGS

`RNAsum` is designed to be compatible with WGS outputs.

The table below lists all input data accepted in `RNAsum`:

| Input file | Tool | Example | Required |
|----|----|----|----|
| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No |
| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No |
| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No |
| Input file | Tool | Example | Required |
|-----------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------|
| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No |
| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No |
| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No |

## Usage

Expand All @@ -203,7 +203,7 @@ export PATH="${rnasum_cli}:${PATH}"
Usage
=====
/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RNAsum/cli/rnasum.R [options]
/Library/Frameworks/R.framework/Versions/4.2/Resources/library/RNAsum/cli/rnasum.R [options]


Options
Expand Down Expand Up @@ -455,7 +455,7 @@ argument

Detailed description of the report structure, including result
prioritisation and visualisation is available
[here](report_structure.md).
[here](./inst/articles/report_structure.md).

#### Results

Expand Down
84 changes: 84 additions & 0 deletions inst/articles/TCGA_projects_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
# TCGA projects summary


The table below summarises [TCGA](https://portal.gdc.cancer.gov/) expression data available for **[33 cancer types](#primary-datasets)**.

Additionally, for *Bladder Urothelial Carcinoma*, *Pancreatic Adenocarcinoma* and *Lung Adenocarcinoma* cohorts extended sets are available (see [Extended datasets](#extended-datasets) table), including neuroendocrine tumours (NETs), intraductal papillary mucinous neoplasm (IPMNs), acinar cell carcinoma (ACC) samples and large-cell neuroendocrine carcinoma (LCNEC).

Finally, 10 samples from each of the [33 datasets](#primary-datasets) were combined to create [Pan-Cancer dataset](#pan-cancer-dataset).

The dataset of interest can be specified by using one of the [TCGA](https://portal.gdc.cancer.gov/) project IDs (`Project` column) for the `--dataset` argument in *[RNAseq_report.R](./rmd_files/RNAseq_report.R)* script (see [Arguments](./README.md#arguments) section).

###### Note

To readuce the data processing time and the size of the final html-based ***Patient Transcriptome Summary*** **report** the following datasets were restricted to inlcude expression data from 300 patients: `BRCA`, `THCA`, `HNSC`,
`LGG`, `KIRC`, `LUSC`, `LUAD`, `PRAD`, `STAD` and `LIHC`.

## Primary datasets

No | Project | Name | Tissue code\* | Samples no.\**
------------ | ------------ | ------------ | ------------ | ------------
1 | `BRCA` | Breast Invasive Carcinoma | 1 | **300**
2 | `THCA` | Thyroid Carcinoma | 1 | **300**
3 | `HNSC` | Head and Neck Squamous Cell Carcinoma | 1 | **300**
4 | `LGG` | Brain Lower Grade Glioma | 1 | **300**
5 | `KIRC` | Kidney Renal Clear Cell Carcinoma | 1 | **300**
6 | `LUSC` | Lung Squamous Cell Carcinoma | 1 | **300**
7 | `LUAD` | Lung Adenocarcinoma | 1 | **300**
8 | `PRAD` | Prostate Adenocarcinoma | 1 | **300**
9 | `STAD` | Stomach Adenocarcinoma | 1 | **300**
10 | `LIHC` | Liver Hepatocellular Carcinoma | 1 | **300**
11 | `COAD` | Colon Adenocarcinoma | 1 | **257**
12 | `KIRP` | Kidney Renal Papillary Cell Carcinoma | 1 | **252**
13 | `BLCA` | Bladder Urothelial Carcinoma | 1 | **246**
14 | `OV` | Ovarian Serous Cystadenocarcinoma | 1 | **220**
15 | `SARC` | Sarcoma | 1 | **214**
16 | `PCPG` | Pheochromocytoma and Paraganglioma | 1 | **177**
17 | `CESC` | Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma | 1 | **171**
18 | `UCEC` | Uterine Corpus Endometrial Carcinoma | 1 | **168**
19 | `PAAD` | Pancreatic Adenocarcinoma | 1 | **150**
20 | `TGCT` | Testicular Germ Cell Tumours | 1 | **149**
21 | `LAML` | Acute Myeloid Leukaemia | 3 | **145**
22 | `ESCA` | Esophageal Carcinoma | 1 | **142**
23 | `GBM` | Glioblastoma Multiforme | 1 | **141**
24 | `THYM` | Thymoma | 1 | **118**
25 | `SKCM` | Skin Cutaneous Melanoma | 1 | **100**
26 | `READ` | Rectum Adenocarcinoma | 1 | **87**
27 | `UVM` | Uveal Melanoma | 1 | **80**
28 | `ACC` | Adrenocortical Carcinoma | 1 | **78**
29 | `MESO` | Mesothelioma | 1 | **77**
30 | `KICH` | Kidney Chromophobe | 1 | **59**
31 | `UCS` | Uterine Carcinosarcoma | 1 | **56**
32 | `DLBC` | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 1 | **47**
33 | `CHOL` | Cholangiocarcinoma | 1 | **34**
<br />

## Extended datasets

No | Project | Name | Tissue code\* | Samples no.\**
------------ | ------------ | ------------ | ------------ | ------------
1 | `LUAD-LCNEC` | Lung Adenocarcinoma dataset including large-cell neuroendocrine carcinoma (LCNEC, n=14) | 1 | **314**
2 | `BLCA-NET` | Bladder Urothelial Carcinoma dataset including neuroendocrine tumours (NETs, n=2) | 1 | **248**
3 | `PAAD-IPMN` | Pancreatic Adenocarcinoma dataset including intraductal papillary mucinous neoplasm (IPMNs, n=2) | 1 | **152**
4 | `PAAD-NET` | Pancreatic Adenocarcinoma dataset including neuroendocrine tumours (NETs, n=8) | 1 | **158**
5 | `PAAD-ACC` | Pancreatic Adenocarcinoma dataset including acinar cell carcinoma (ACCs, n=1) | 1 | **151**
<br />

## Pan-Cancer dataset

No | Project | Name | Tissue code\* | Samples no.\**
------------ | ------------ | ------------ | ------------ | ------------
1 | `PANCAN` | Samples from all [33 cancer types](#primary-datasets), 10 samples from each | 1 and 3 (`LAML` samples only) | **330**
<br />

\* Tissue codes:

Tissue code | Letter code | Definition
------------ | ------------ | ------------
1 | TP | Primary solid Tumour
3 | TB | Primary Blood Derived Cancer - Peripheral Blood
<br />

\** Each dataset was cleaned based on the quality metrics provided in the *Merged Sample Quality Annotations* file **[merged_sample_quality_annotations.tsv](http://api.gdc.cancer.gov/data/1a7d7be8-675d-4e60-a105-19d4121bdebf)** from [TCGA PanCanAtlas initiative webpage](https://gdc.cancer.gov/about-data/publications/pancanatlas) (see [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/tree/master/expression/README.md#data-clean-up) repository for more details).


Loading

0 comments on commit bf35b42

Please sign in to comment.