Merge pull request #160 from umccr/readme_updates

Update links in the Readme
umccr · Sep 3, 2024 · bf35b42 · bf35b42
2 parents d0f5937 + bd395a5
commit bf35b42
Show file tree

Hide file tree

Showing 5 changed files with 471 additions and 34 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -18,8 +18,7 @@ knitr::opts_chunk$set(
 
 `RNAsum` is an R package that can post-process, summarise and visualise
 outputs primarily from [DRAGEN RNA][dragen-rna] pipelines.
-Its main application is to complement genome-based findings from the
-[umccrise][umccrise] pipeline and to provide additional evidence for detected
+Its main application is to complement whole-genome based findings and to provide additional evidence for detected
 alterations.
 
 [dragen-rna]: <https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html>
@@ -59,7 +58,7 @@ docker pull ghcr.io/umccr/rnasum:latest
 ## Workflow
 
 The pipeline consists of five main components illustrated and briefly
-described below. For more details, see [workflow.md](/workflow.md).
+described below. For more details, see [workflow.md](./inst/articles/workflow.md).
 
 <img src="man/figures/RNAsum_workflow_updated.png" width="100%">
 
@@ -81,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md).
    potential druggable targets.
 5. The final product is an interactive HTML report with searchable tables and
    plots presenting expression levels of the genes of interest. The report
-   consists of several sections described [here](./articles/report_structure.md).
+   consists of several sections described [here](./inst/articles/report_structure.md).
 
 ## Reference data
 
@@ -100,10 +99,10 @@ Depending on the tissue from which the patient's sample was taken, one of
 **33 cancer datasets** from TCGA can be used as a reference cohort for comparing
 expression changes in genes of interest of the patient. Additionally, 10 samples
 from each of the 33 TCGA datasets were combined to create the
-**[Pan-Cancer dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**,
-and for some cohorts **[extended sets](./articles/tcga_projects_summary.md#extended-datasets)**
+**[Pan-Cancer dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
+and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)**
 are also available. All available datasets are listed in the
-**[TCGA projects summary table](./articles/tcga_projects_summary.md)**. These datasets
+**[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets
 have been processed using methods described in the
 [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
 repository. The dataset of interest can be specified by using one of the
@@ -119,7 +118,7 @@ analytical pipelines. Moreover, TCGA data may include samples from tissue
 material of lower quality and cellularity compared to samples processed using
 local protocols. To address these issues, we have built a high-quality internal
 reference cohort processed using the same pipelines as input data
-(see [data pre-processing](./articles/workflow.md#data-processing)).
+(see [data pre-processing](./inst/articles/workflow.md#data-processing)).
 
 This internal reference set of **40 pancreatic cancer samples** is based on WTS
 data generated at **[UMCCR](https://research.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group)**
@@ -359,7 +358,7 @@ sections, including:
 \*\* if genome-based results are available; see `--umccrise` argument
 
 Detailed description of the report structure, including result prioritisation
-and visualisation is available [here](report_structure.md).
+and visualisation is available [here](./inst/articles/report_structure.md).
 
 #### Results
 

diff --git a/README.md b/README.md
@@ -19,9 +19,8 @@
 `RNAsum` is an R package that can post-process, summarise and visualise
 outputs primarily from [DRAGEN
 RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html)
-pipelines. Its main application is to complement genome-based findings
-from the [umccrise](https://github.com/umccr/umccrise) pipeline and to
-provide additional evidence for detected alterations.
+pipelines. Its main application is to complement whole-genome based
+findings and to provide additional evidence for detected alterations.
 
 **DOCS**: <https://umccr.github.io/RNAsum>
 
@@ -54,7 +53,8 @@ docker pull ghcr.io/umccr/rnasum:latest
 ## Workflow
 
 The pipeline consists of five main components illustrated and briefly
-described below. For more details, see [workflow.md](/workflow.md).
+described below. For more details, see
+[workflow.md](./inst/articles/workflow.md).
 
 <img src="man/figures/RNAsum_workflow_updated.png" width="100%">
 
@@ -80,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md).
 5.  The final product is an interactive HTML report with searchable
     tables and plots presenting expression levels of the genes of
     interest. The report consists of several sections described
-    [here](./articles/report_structure.md).
+    [here](./inst/articles/report_structure.md).
 
 ## Reference data
 
@@ -101,12 +101,12 @@ of **33 cancer datasets** from TCGA can be used as a reference cohort
 for comparing expression changes in genes of interest of the patient.
 Additionally, 10 samples from each of the 33 TCGA datasets were combined
 to create the **[Pan-Cancer
-dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, and
-for some cohorts **[extended
-sets](./articles/tcga_projects_summary.md#extended-datasets)** are also
-available. All available datasets are listed in the **[TCGA projects
-summary table](./articles/tcga_projects_summary.md)**. These datasets
-have been processed using methods described in the
+dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**,
+and for some cohorts **[extended
+sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are
+also available. All available datasets are listed in the **[TCGA
+projects summary table](./inst/articles/tcga_projects_summary.md)**.
+These datasets have been processed using methods described in the
 [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data)
 repository. The dataset of interest can be specified by using one of the
 TCGA project IDs for the `RNAsum` `--dataset` argument (see
@@ -122,7 +122,7 @@ may include samples from tissue material of lower quality and
 cellularity compared to samples processed using local protocols. To
 address these issues, we have built a high-quality internal reference
 cohort processed using the same pipelines as input data (see [data
-pre-processing](./articles/workflow.md#data-processing)).
+pre-processing](./inst/articles/workflow.md#data-processing)).
 
 This internal reference set of **40 pancreatic cancer samples** is based
 on WTS data generated at
@@ -170,24 +170,24 @@ quantification file.
 
 The table below lists all input data accepted in `RNAsum`:
 
-| Input file | Tool | Example | Required |
-|----|----|----|----|
-| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** |
-| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** |
-| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
-| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No |
+| Input file                           | Tool                                                                                                                                                 | Example                                                                                              | Required |
+|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------|
+| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf)                                          | **Yes**  |
+| Quantified gene **abundances**       | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf)                                | **Yes**  |
+| **Fusion gene** list                 | [Arriba](https://arriba.readthedocs.io/en/latest/)                                                                                                   | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final)                | No       |
+| **Fusion gene** list                 | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No       |
 
 ### WGS
 
 `RNAsum` is designed to be compatible with WGS outputs.
 
 The table below lists all input data accepted in `RNAsum`:
 
-| Input file | Tool | Example | Required |
-|----|----|----|----|
-| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No |
-| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No |
-| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No |
+| Input file      | Tool                                                                    | Example                                                                                                                   | Required |
+|-----------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------|
+| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr)                                  | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No       |
+| **CNVs**        | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv)                           | No       |
+| **SVs**         | [Manta](https://github.com/Illumina/manta)                              | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv)           | No       |
 
 ## Usage
 
@@ -203,7 +203,7 @@ export PATH="${rnasum_cli}:${PATH}"
     Usage
     =====
      
-    /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RNAsum/cli/rnasum.R [options]
+    /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RNAsum/cli/rnasum.R [options]
 
 
     Options
@@ -455,7 +455,7 @@ argument
 
 Detailed description of the report structure, including result
 prioritisation and visualisation is available
-[here](report_structure.md).
+[here](./inst/articles/report_structure.md).
 
 #### Results
 

diff --git a/inst/articles/TCGA_projects_summary.md b/inst/articles/TCGA_projects_summary.md
@@ -0,0 +1,84 @@
+# TCGA projects summary
+
+
+The table below summarises [TCGA](https://portal.gdc.cancer.gov/) expression data available for **[33 cancer types](#primary-datasets)**. 
+
+Additionally, for *Bladder Urothelial Carcinoma*, *Pancreatic Adenocarcinoma* and *Lung Adenocarcinoma* cohorts extended sets are available (see [Extended datasets](#extended-datasets) table), including neuroendocrine tumours (NETs), intraductal papillary mucinous neoplasm (IPMNs), acinar cell carcinoma (ACC) samples and large-cell neuroendocrine carcinoma (LCNEC).
+
+Finally, 10 samples from each of the [33 datasets](#primary-datasets) were combined to create [Pan-Cancer dataset](#pan-cancer-dataset).
+
+The dataset of interest can be specified by using one of the [TCGA](https://portal.gdc.cancer.gov/) project IDs (`Project` column) for the `--dataset` argument in *[RNAseq_report.R](./rmd_files/RNAseq_report.R)* script (see [Arguments](./README.md#arguments) section).
+
+###### Note
+
+To readuce the data processing time and the size of the final html-based ***Patient Transcriptome Summary*** **report** the following datasets were restricted to inlcude expression data from 300 patients: `BRCA`, `THCA`, `HNSC`, 
+`LGG`, `KIRC`, `LUSC`, `LUAD`, `PRAD`, `STAD` and `LIHC`.
+
+## Primary datasets
+
+No | Project | Name | Tissue code\* | Samples no.\**
+------------ | ------------ | ------------ | ------------ | ------------
+1 | `BRCA`  | Breast Invasive Carcinoma | 1 | **300**
+2 | `THCA`  | Thyroid Carcinoma | 1 | **300**
+3 | `HNSC`  | Head and Neck Squamous Cell Carcinoma | 1 | **300**
+4 | `LGG`   | Brain Lower Grade Glioma | 1 | **300**
+5 | `KIRC`  | Kidney Renal Clear Cell Carcinoma | 1 | **300**
+6 | `LUSC`  | Lung Squamous Cell Carcinoma | 1 | **300**
+7 | `LUAD`  | Lung Adenocarcinoma | 1 | **300**
+8 | `PRAD`  | Prostate Adenocarcinoma | 1 | **300**
+9 | `STAD`  | Stomach Adenocarcinoma | 1 | **300**
+10 | `LIHC`  | Liver Hepatocellular Carcinoma | 1 | **300**
+11 | `COAD`  | Colon Adenocarcinoma | 1 | **257**
+12 | `KIRP`  | Kidney Renal Papillary Cell Carcinoma | 1 | **252**
+13 | `BLCA`  | Bladder Urothelial Carcinoma | 1 | **246**
+14 | `OV`    | Ovarian Serous Cystadenocarcinoma | 1 | **220**
+15 | `SARC`  | Sarcoma | 1 | **214**
+16 | `PCPG`  | Pheochromocytoma and Paraganglioma | 1 | **177**
+17 | `CESC`  | Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma | 1 | **171**
+18 | `UCEC`  | Uterine Corpus Endometrial Carcinoma | 1 | **168**
+19 | `PAAD`  | Pancreatic Adenocarcinoma | 1 | **150**
+20 | `TGCT`  | Testicular Germ Cell Tumours | 1 | **149**
+21 | `LAML`  | Acute Myeloid Leukaemia | 3 | **145**
+22 | `ESCA`  | Esophageal Carcinoma | 1 | **142**
+23 | `GBM`   | Glioblastoma Multiforme | 1 | **141**
+24 | `THYM`  | Thymoma | 1 | **118**
+25 | `SKCM`  | Skin Cutaneous Melanoma | 1 | **100**
+26 | `READ`  | Rectum Adenocarcinoma | 1 | **87**
+27 | `UVM`   | Uveal Melanoma | 1 | **80**
+28 | `ACC`   | Adrenocortical Carcinoma | 1 | **78**
+29 | `MESO`  | Mesothelioma | 1 | **77**
+30 | `KICH`  | Kidney Chromophobe | 1 | **59**
+31 | `UCS`   | Uterine Carcinosarcoma | 1 | **56**
+32 | `DLBC`  | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 1 | **47**
+33 | `CHOL`  | Cholangiocarcinoma | 1 | **34**
+<br />
+
+## Extended datasets
+
+No | Project | Name | Tissue code\* | Samples no.\**
+------------ | ------------ | ------------ | ------------ | ------------
+1 | `LUAD-LCNEC`  | Lung Adenocarcinoma dataset including large-cell neuroendocrine carcinoma (LCNEC, n=14) | 1 | **314**
+2 | `BLCA-NET`  | Bladder Urothelial Carcinoma dataset including neuroendocrine tumours (NETs, n=2) | 1 | **248**
+3 | `PAAD-IPMN`  | Pancreatic Adenocarcinoma dataset including intraductal papillary mucinous neoplasm (IPMNs, n=2) | 1 | **152**
+4 | `PAAD-NET`  | Pancreatic Adenocarcinoma dataset including neuroendocrine tumours (NETs, n=8) | 1 | **158**
+5 | `PAAD-ACC`  | Pancreatic Adenocarcinoma dataset including acinar cell carcinoma (ACCs, n=1) | 1 | **151**
+<br />
+
+## Pan-Cancer dataset
+
+No | Project | Name | Tissue code\* | Samples no.\**
+------------ | ------------ | ------------ | ------------ | ------------
+1 | `PANCAN`  | Samples from all [33 cancer types](#primary-datasets), 10 samples from each  | 1 and 3 (`LAML` samples only) | **330**
+<br />
+
+\* Tissue codes:
+
+Tissue code | Letter code | Definition
+------------ | ------------ | ------------
+1 | TP  | Primary solid Tumour
+3 | TB  | Primary Blood Derived Cancer - Peripheral Blood
+<br />
+
+\** Each dataset was cleaned based on the quality metrics provided in the *Merged Sample Quality Annotations* file **[merged_sample_quality_annotations.tsv](http://api.gdc.cancer.gov/data/1a7d7be8-675d-4e60-a105-19d4121bdebf)** from [TCGA PanCanAtlas initiative webpage](https://gdc.cancer.gov/about-data/publications/pancanatlas) (see [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/tree/master/expression/README.md#data-clean-up) repository for more details).
+
+