From 95c3233aac3af65c3c13f71c3e72b5d67aa4cd3a Mon Sep 17 00:00:00 2001 From: Sehrish Date: Wed, 14 Aug 2024 18:26:28 +1000 Subject: [PATCH 1/6] add supporting docx --- inst/articles/TCGA_projects_summary.md | 84 +++++++++ inst/articles/report_structure.md | 126 ++++++++++++++ inst/articles/workflow.md | 228 +++++++++++++++++++++++++ 3 files changed, 438 insertions(+) create mode 100644 inst/articles/TCGA_projects_summary.md create mode 100644 inst/articles/report_structure.md create mode 100644 inst/articles/workflow.md diff --git a/inst/articles/TCGA_projects_summary.md b/inst/articles/TCGA_projects_summary.md new file mode 100644 index 00000000..31e3127f --- /dev/null +++ b/inst/articles/TCGA_projects_summary.md @@ -0,0 +1,84 @@ +# TCGA projects summary + + +The table below summarises [TCGA](https://portal.gdc.cancer.gov/) expression data available for **[33 cancer types](#primary-datasets)**. + +Additionally, for *Bladder Urothelial Carcinoma*, *Pancreatic Adenocarcinoma* and *Lung Adenocarcinoma* cohorts extended sets are available (see [Extended datasets](#extended-datasets) table), including neuroendocrine tumours (NETs), intraductal papillary mucinous neoplasm (IPMNs), acinar cell carcinoma (ACC) samples and large-cell neuroendocrine carcinoma (LCNEC). + +Finally, 10 samples from each of the [33 datasets](#primary-datasets) were combined to create [Pan-Cancer dataset](#pan-cancer-dataset). + +The dataset of interest can be specified by using one of the [TCGA](https://portal.gdc.cancer.gov/) project IDs (`Project` column) for the `--dataset` argument in *[RNAseq_report.R](./rmd_files/RNAseq_report.R)* script (see [Arguments](./README.md#arguments) section). + +###### Note + +To readuce the data processing time and the size of the final html-based ***Patient Transcriptome Summary*** **report** the following datasets were restricted to inlcude expression data from 300 patients: `BRCA`, `THCA`, `HNSC`, +`LGG`, `KIRC`, `LUSC`, `LUAD`, `PRAD`, `STAD` and `LIHC`. + +## Primary datasets + +No | Project | Name | Tissue code\* | Samples no.\** +------------ | ------------ | ------------ | ------------ | ------------ +1 | `BRCA` | Breast Invasive Carcinoma | 1 | **300** +2 | `THCA` | Thyroid Carcinoma | 1 | **300** +3 | `HNSC` | Head and Neck Squamous Cell Carcinoma | 1 | **300** +4 | `LGG` | Brain Lower Grade Glioma | 1 | **300** +5 | `KIRC` | Kidney Renal Clear Cell Carcinoma | 1 | **300** +6 | `LUSC` | Lung Squamous Cell Carcinoma | 1 | **300** +7 | `LUAD` | Lung Adenocarcinoma | 1 | **300** +8 | `PRAD` | Prostate Adenocarcinoma | 1 | **300** +9 | `STAD` | Stomach Adenocarcinoma | 1 | **300** +10 | `LIHC` | Liver Hepatocellular Carcinoma | 1 | **300** +11 | `COAD` | Colon Adenocarcinoma | 1 | **257** +12 | `KIRP` | Kidney Renal Papillary Cell Carcinoma | 1 | **252** +13 | `BLCA` | Bladder Urothelial Carcinoma | 1 | **246** +14 | `OV` | Ovarian Serous Cystadenocarcinoma | 1 | **220** +15 | `SARC` | Sarcoma | 1 | **214** +16 | `PCPG` | Pheochromocytoma and Paraganglioma | 1 | **177** +17 | `CESC` | Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma | 1 | **171** +18 | `UCEC` | Uterine Corpus Endometrial Carcinoma | 1 | **168** +19 | `PAAD` | Pancreatic Adenocarcinoma | 1 | **150** +20 | `TGCT` | Testicular Germ Cell Tumours | 1 | **149** +21 | `LAML` | Acute Myeloid Leukaemia | 3 | **145** +22 | `ESCA` | Esophageal Carcinoma | 1 | **142** +23 | `GBM` | Glioblastoma Multiforme | 1 | **141** +24 | `THYM` | Thymoma | 1 | **118** +25 | `SKCM` | Skin Cutaneous Melanoma | 1 | **100** +26 | `READ` | Rectum Adenocarcinoma | 1 | **87** +27 | `UVM` | Uveal Melanoma | 1 | **80** +28 | `ACC` | Adrenocortical Carcinoma | 1 | **78** +29 | `MESO` | Mesothelioma | 1 | **77** +30 | `KICH` | Kidney Chromophobe | 1 | **59** +31 | `UCS` | Uterine Carcinosarcoma | 1 | **56** +32 | `DLBC` | Lymphoid Neoplasm Diffuse Large B-cell Lymphoma | 1 | **47** +33 | `CHOL` | Cholangiocarcinoma | 1 | **34** +
+ +## Extended datasets + +No | Project | Name | Tissue code\* | Samples no.\** +------------ | ------------ | ------------ | ------------ | ------------ +1 | `LUAD-LCNEC` | Lung Adenocarcinoma dataset including large-cell neuroendocrine carcinoma (LCNEC, n=14) | 1 | **314** +2 | `BLCA-NET` | Bladder Urothelial Carcinoma dataset including neuroendocrine tumours (NETs, n=2) | 1 | **248** +3 | `PAAD-IPMN` | Pancreatic Adenocarcinoma dataset including intraductal papillary mucinous neoplasm (IPMNs, n=2) | 1 | **152** +4 | `PAAD-NET` | Pancreatic Adenocarcinoma dataset including neuroendocrine tumours (NETs, n=8) | 1 | **158** +5 | `PAAD-ACC` | Pancreatic Adenocarcinoma dataset including acinar cell carcinoma (ACCs, n=1) | 1 | **151** +
+ +## Pan-Cancer dataset + +No | Project | Name | Tissue code\* | Samples no.\** +------------ | ------------ | ------------ | ------------ | ------------ +1 | `PANCAN` | Samples from all [33 cancer types](#primary-datasets), 10 samples from each | 1 and 3 (`LAML` samples only) | **330** +
+ +\* Tissue codes: + +Tissue code | Letter code | Definition +------------ | ------------ | ------------ +1 | TP | Primary solid Tumour +3 | TB | Primary Blood Derived Cancer - Peripheral Blood +
+ +\** Each dataset was cleaned based on the quality metrics provided in the *Merged Sample Quality Annotations* file **[merged_sample_quality_annotations.tsv](http://api.gdc.cancer.gov/data/1a7d7be8-675d-4e60-a105-19d4121bdebf)** from [TCGA PanCanAtlas initiative webpage](https://gdc.cancer.gov/about-data/publications/pancanatlas) (see [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/tree/master/expression/README.md#data-clean-up) repository for more details). + + \ No newline at end of file diff --git a/inst/articles/report_structure.md b/inst/articles/report_structure.md new file mode 100644 index 00000000..48870211 --- /dev/null +++ b/inst/articles/report_structure.md @@ -0,0 +1,126 @@ +## RNAsum sections + + +* [Input data](#input-data) +* [Clinical information](#clinical-information) +* [Findings summary](#findings-summary) +* [Mutated genes](#mutated-genes) +* [Fusion genes](#fusion-genes) + * [Prioritisation](#prioritisation) + * [Filtering](#filtering) + * [Abundant transcripts](#abundant-transcripts) +* [Structural variants](#structural-variants) +* [CN altered genes](#cn-altered-genes) +* [Immune markers](#immune-markers) +* [HRD genes](#hrd-genes) +* [Cancer genes](#cancer-genes) +* [Drug matching](#drug-matching) +* [Addendum](#addendum) + + + +
+ +The **`Mutated genes`**, **`Structural variants`** and **`CN altered genes`** sections will contain information about expression levels of the mutated genes, genes located within detected structural variants (SVs) and copy-number (CN) altered regions, respectively. Genes will be ordered by increasing *variants* `TIER`, *SV* `score` and `CN` *value*, resepctively, and then by decreasing absolute values in the `Patient` vs selected `dataset` column. Moreover, gene fusions detected in WTS data and reported in **`Fusion genes`** section will be first ordered based on the evidence from genome-based data (`DNA support (gene A/B)` columns). + +*** + +### Input data + +Summary of the input data + +*** + +### Clinical information + +Treatment regimen information for patient for which clinical information is available. + +NOTE: for confidentiality reasons, the timeline (x-axis) projecting patient’s treatment regimens (y-axis) is set to start from 1st January 2000, but the treatments lengths are preserved. + +*** + +### Findings summary + +Plot and table summarising altered genes listed across various report sections + +*** + +### Mutated genes + +mRNA expression levels of mutated genes (containing single nucleotide variants (SNVs) or insertions/deletions (indels)) measured in patient's sample and their average mRNA expression in samples from cancer patients (from [TCGA](https://portal.gdc.cancer.gov/)). This section is available only for samples with available *[umccrise](https://github.com/umccr/umccrise) results* + +*** + +### Fusion genes + +Prioritised fusion genes based on [Arriba](https://arriba.readthedocs.io/en/latest/) results and annotated with [FusionGDB](https://ccsm.uth.edu/FusionGDB) database. If WGS results from **[umccrise](https://github.com/umccr/umccrise)** are available then fusion genes in the **`Fusion genes`** report section are ordered based on the evidence from genome-based data. For more information about gene fusions and methods for their detectecion and visualisation can be found [here](./fusions/README.md). + +#### Prioritisation + +Fusion genes detected in transcriptome data are prioritised based on criteria ranked in the following order: + +1. Involvement of fusion gene(s) **detected in genomic data** (if [Structural variants](#structural-variants) results are available) +2. **Detected in transcriptome data** by [Arriba](https://arriba.readthedocs.io/en/latest/) tool +3. **Reported** fusion event according to [FusionGDB](https://ccsm.uth.edu/FusionGDB/) database +4. Decreasing number of **split reads** +5. Decreasing number of **pair reads** +6. Involvement of **cancer gene(s)** (see [Cancer genes](#cancer-genes) section) + +#### Filtering + +Fusion genes detected in transcriptome data are reported if **at least one** of the following criteria is met: + +1. Involvement of fusion gene(s) **detected in genomic data** (if [Structural variants](#structural-variants) results are available) +2. **Reported** fusion event according to [FusionGDB](https://ccsm.uth.edu/FusionGDB) database +3. Involvement of **cancer gene(s)** (see [Cancer genes](#cancer-genes) section) +4. **Split reads** > 1 +5. **Pair reads** > 1 and **split reads** > 1 + +*** + +### Structural variants + +Similar to *Mutated genes* analysis but limited to genes located within structural variants (SVs) detected by [MANTA](https://github.com/Illumina/manta) using genomic data. This section is available only for samples with available *[MANTA](https://github.com/Illumina/manta) results*. + +*** + +### CN altered genes + +Section overlaying the mRNA expression data for [cancer genes](#cancer-genes) with per-gene somatic copy-number (CN) data (from [PURPLE](https://anaconda.org/bioconda/hmftools-purple)) and mutation status, if available. + +*** + +### Immune markers + +Similar to *Mutated genes* analysis but limited to genes considered to be immune markers. The immune markers used in the report are listed in PanelApp panel [Immune markers for WTS report](https://panelapp.agha.umccr.org/panels/243/). + +*** + +### HRD genes + +Similar to *Mutated genes* analysis but limited to genes considered to be homologous recombination deficiency (HRD) genes. The HRD genes used in the report are listed in PanelApp panel [Homologous recombination deficiency (HDR) for WTS report](https://panelapp.agha.umccr.org/panels/242/). + +*** + +### Cancer genes + +Similar to analysis above, but limited to *UMCCR cancer genes*. + +*** + +### Drug matching + +List of drugs targeting variants in detected *mutated genes*, *fusion genes*, *structural variants-affected genes*, *CN altered genes*, *HRD genes* and dysregulated *cancer genes*, which can be considered in the treatment decision making process. + +###### Note + +This section is not displayed as default. Set the `--drugs` argument to `TRUE` to present it in the report. + +*** + +### Addendum + +Additional information, including `Parameters`, `Reporter details` and R `Session information`, added at the end of the report. + +
+ diff --git a/inst/articles/workflow.md b/inst/articles/workflow.md new file mode 100644 index 00000000..3787a101 --- /dev/null +++ b/inst/articles/workflow.md @@ -0,0 +1,228 @@ +## RNAsum data processing workflow + +The description of the main workflow components involved in (**1**) *[read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* and *[gene fusions](./data/test_data/final/test_sample_WTS/arriba/fusions.tsv)* data **[collection](#1-data-collection)**, (**2**) *[read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* data **[processing](#1-data-processing)**, (**3**) **[integration](#2-integration-with-wgs-based-results)** with **[WGS](./README.md#wgs)**-based data (processed using *[umccrise](https://github.com/umccr/umccrise)* pipeline), (**4**) results **[annotation](#3-results-annotation)** and (**5**) presentation in the *Patient Transcriptome Summary* **[report](#4-report-generation)**. + + + +
+ +## Table of contents + + +* [1. Data collection](#1-data-collection) +* [2. Data processing](#2-data-processing) + * [Counts processing](#counts-processing) + * [Data collection](#data-collection) + * [Transformation](#transformation) + * [Filtering (optional)](#filtering-optional) + * [Normalisation (optional)](#normalisation-optional) + * [Combination](#combination) + * [Batch-effects correction (optional)](#batch-effects-correction-optional) + * [Data scaling](#data-scaling) +* [3. Integration with WGS-based results](#3-integration-with-wgs-based-results) + * [Somatic SNVs and small indels](#somatic-snvs-and-small-indels) + * [Structural variants](#structural-variants) + * [Somatic CNVs](#somatic-cnvs) +* [4. Results annotation](#4-results-annotation) + * [Key cancer genes](#key-cancer-genes) + * [OncoKB](#oncokb) + * [VICC](#vicc) + * [CIViC](#civic) + * [CGI](#cgi) + * [FusionGDB](#fusiongdb) +* [5. Report generation](#5-report-generation) + + + +## 1. Data collection + +**[Read counts](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)** data from patient sample are collected from *[bcbio-nextgen RNA-seq](https://bcbio-nextgen.readthedocs.io/en/latest/contents/bulk_rnaseq.html)* or *[DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html)* pipeline. + +## 2. Data processing + +### Counts processing + +The **read count** data (see [Input data](./README.md#input-data) section in the main page) in *[abundance.tsv](./data/test_data/final/test_sample_WTS/kallisto/abundance.tsv)* or *[quant.sf](./data/test_data/stratus/test_sample_WTS/TEST.quant.sf)* quantification files from [kallisto](https://pachterlab.github.io/kallisto/about) or [salmon](https://salmon.readthedocs.io/en/latest/salmon.html), respectively, are processed following steps illustrated in [Figure 1](./img/counts_post-processing_scheme.png) and described below. + + + +###### Figure 1 +>Counts processing scheme. + +#### Data collection + +([Figure 1](./img/counts_post-processing_scheme.png)A) + +* Load read count files from the following three sets of data: + + 1. patient **sample** (see [Input data](./README.md#input-data) section in the main page) + 2. **external reference** cohort ([TCGA](https://tcga-data.nci.nih.gov/), available cancer types are listed in [TCGA projects summary table](./TCGA_projects_summary.md)) corresponding to the patient cancer sample + 3. UMCCR **internal reference** set of in-house pancreatic cancer samples (regardless of the patient sample origin; see [Input data](./README.md#input-data) section in the main page) + +#### Transformation + +([Figure 1](./img/counts_post-processing_scheme.png)B) + +* Subset datasets to include common genes +* Combine patient **sample** and **internal reference** dataset +* Convert counts to **[CPM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)** (*Counts Per Million*; default) or **[TPM](https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/)** (*Transcripts Per Kilobase Million*) values in: + 1. **sample** + **internal reference** set + 2. **external reference** set + +#### Filtering (optional) + +([Figure 1](./img/counts_post-processing_scheme.png)C) + +* Filter out genes with low counts (CPM or TPM **< 1** in more than 90% of samples) in: + 1. **sample** + **internal reference** set + 2. **external reference** set + +#### Normalisation (optional) + +([Figure 1](./img/counts_post-processing_scheme.png)D) + +* Normalise data (see [Arguments](./README.md#arguments) section in the main page for available options) for sample-specific effects in: + 1. **sample** + **internal reference** set + 2. **external reference** set + +#### Combination + +([Figure 1](./img/counts_post-processing_scheme.png)E) + +* Subset datasets to include common genes +* Combine **sample** + **internal reference** set with **external reference** set + +#### Batch-effects correction (optional) + +([Figure 1](./img/counts_post-processing_scheme.png)F) + +* Consider the patient **sample** + **internal reference** (regardless of the patient sample origin) as one batch (both sets processed with the same pipeline) and corresponding **[TCGA](https://tcga-data.nci.nih.gov/) dataset** as another batch. The objective is to remove data variation due to technical factors. + +#### Data scaling + +The processed count data is scaled to facilitate expression values interpretation. The data is either scaled **[gene-wise](#gene-wise-z-scoreztransformation)** (Z-score transformation, default) or **[group-wise](#group-wise-centering)** (centering). + +##### Gene-wise + +Z-scores are comparable by measuring the observations in multiples of the standard deviation of given sample. The gene-wise Z-score transformation procedure is illustrated in [Figure 2](./img/Z-score_transformation_gene_wise.png) and is described below. + + + +###### Figure 2 +>Gene-wise Z-score transformation scheme. + +* Extract expression values across all samples for a given **gene** ([Figure 2](./img/Z-score_transformation_gene_wise.png)A) +* Compute **Z-scores** for individual samples (see equation in ([Figure 2](./img/Z-score_transformation_gene_wise.png)B) +* Compute **median Z-scores** for ([Figure 2](./img/Z-score_transformation_gene_wise.png)C): + 1. **internal reference** set\* + 2. **external reference** set + +* Present patient sample **Z-score** in the context the reference cohorts' **median Z-scores** ([Figure 2](./img/Z-score_transformation_gene_wise.png)D) + +\* used only for pancreatic cancer patients + +##### Group-wise + +The group-wise centering apporach is presented in [Figure 3](./img/centering_group_wise.png) and is described below. + + + + +###### Figure 3 +>Group-wise centering scheme. + +* Extract expression values for ([Figure 3](./img/centering_group_wise.png)A): + 1. patient **sample** + 2. **internal reference** set\* + 3. **external reference** set + +* For each gene compute **median expression** value in ([Figure 3](./img/centering_group_wise.png)B): + 1. **internal reference** set\* + 2. **external reference** set + +* **Center** the median expression values for each gene in individual groups ([Figure 3](./img/centering_group_wise.png)C) +* Present patient sample **centered** expression values in the context the reference cohorts' **centered** values ([Figure 3](./img/centering_group_wise.png)D) + +\* used only for pancreatic cancer patients + + +## 3. Integration with WGS-based results + +For patients with available [WGS](./README.md#wgs) data processed using *[umccrise](https://github.com/umccr/umccrise)* pipeline (see ```--umccrise``` [argument](README.md/#arguments)) the expression level information for [mutated](#somatic-snvs-and-small-indels) genes or genes located within detected [structural variants](#structural-variants) (SVs) or [copy-number](#somatic-cnvs) (CN) [altered regions](#somatic-cnvs), as well as the genome-based findings are incorporated and used as primary source for expression profiles prioritisation. + +### Somatic SNVs and small indels + +* Check if **[PCGR](https://github.com/sigven/pcgr)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/pcgr/test_sample_WGS-somatic.pcgr.snvs_indels.tiers.tsv)) is available +* **Extract** expression level **information** and genome-based findings for genes with detected genomic variants (use ```--pcgr_tier``` [argument](README.md/#arguments) to define [tier]([https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg](https://sigven.github.io/pcgr/articles/variant_classification.html) threshold value) +* **Ordered genes** by increasing variants **[tier]([https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg](https://sigven.github.io/pcgr/articles/variant_classification.html)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort + +### Structural variants + +* Check if **[Manta](https://github.com/Illumina/manta)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/structural/test_sample_WGS-sv-prioritize-manta-pass.tsv)) is available +* **Extract** expression level **information** and genome-based findings for genes located within detected SVs +* **Ordered genes** by increasing **[SV score](https://github.com/vladsaveliev/simple_sv_annotation)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort +* **Compare** [gene fusions](./fusions) detected in [WTS](./README.md#wts) data ([arriba](https://arriba.readthedocs.io/en/latest/) and [pizzly](https://github.com/pmelsted/pizzly)) and [WGS](./README.md#wgs) data ([Manta](https://github.com/Illumina/manta)) +* **Priritise** [WGS](./README.md#wgs)-supported [gene fusions](./fusions) + +### Somatic CNVs + +* Check if **[PURPLE](https://github.com/hartwigmedical/hmftools/blob/master/purple/README.md)** output file (see [example](./data/test_data/umccrised/test_sample_WGS/purple/test_sample_WGS.purple.gene.cnv)) is available +* **Extract** expression level **information** and genome-based findings for genes located within detected CNVs (use ```--cn_loss ``` and ```--cn_gain ``` [arguments](README.md/#arguments) to define CN threshold values to classify genes within lost and gained regions) +* **Ordered genes** by increasing (for genes within lost regions) or decreasing (for genes within gained regions) **[CN](https://github.com/umccr/umccrise/blob/master/workflow.md#somatic-cnv)** and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort + +## 4. Results annotation + +[WTS](./README.md#wts)- and/or [WGS](./README.md#wgs)-based results for the altered genes are collated with **knowledge** derived from in-house resources and public **databases** (listed below) to provide additional source of evidence for their significance, e.g. to flag variants with clinical significance or potential druggable targets. + +### Key cancer genes + +* [UMCCR key cancer genes set](https://github.com/vladsaveliev/NGS_Utils/blob/master/ngs_utils/reference_data/key_genes/make_umccr_cancer_genes.Rmd) build of off several sources: + * [Cancermine](http://bionlp.bcgsc.ca/cancermine/) with at least 2 publication with at least 3 citations + * [NCG known cancer genes](http://ncg.kcl.ac.uk/) + * Tier 1 [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (CGC) + * [CACAO](https://github.com/sigven/cacao) hotspot genes (curated from [ClinVar](https://www.ncbi.nlm.nih.gov/clinvar/), [CiViC](https://civicdb.org/), [Cancer Hotspots](https://www.cancerhotspots.org/)) + * At least 2 matches in the following 5 sources and 8 clinical panels: + * Cancer predisposition genes ([CPSR](https://github.com/sigven/cpsr) list) + * [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (tier 2) + * AstraZeneca 300 (AZ300) + * Familial Cancer + * [OncoKB](https://oncokb.org/) annotated + * MSKC-IMPACT + * MSKC-Heme + * PMCC-CCP + * Illumina-TS500 + * TEMPUS + * Foundation One + * Foundation Heme + * Vogelstein + +* Used for extracting expression levels of cancer genes (presented in the `Cancer genes` report section) +* Used to prioritise candidate [fusion genes](./fusions) + +### OncoKB + +* [OncoKB](https://oncokb.org/cancerGenes) gene list is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) + + +### VICC + +* [Variant Interpretation for Cancer Consortium](https://cancervariants.org/) (VICC) knowledgebase is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) + + +### CIViC + +* The [Clinical Interpretation of Variants in Cancer](https://civicdb.org/) (CIViC) database is used to annotate altered genes across various sections in the report (annotations and URL links in `External resources` column in report `Summary tables`) +* Used to flag clinically actionable aberrations in the `Drug matching` report section + +### CGI + +* The [Cancer Genome Interpreter](https://www.cancergenomeinterpreter.org/biomarkers) (CGI) database is used to flag genes known to be involved in gene fusions and to prioritise candidate [fusion genes](./fusions) + +### FusionGDB + +* [FusionGDB](https://ccsm.uth.edu/FusionGDB/) database is used to flag genes known to be involved in gene fusions and to prioritise candidate [gene fusions](./fusions) + +### 5. Report generation + +The final html-based ***Patient Transcriptome Summary*** **report** contains searchable tables and interactive plots presenting expression levels of altered genes, as well as links to public resources providing additional source of evidence for their significance. The individual **[report sections](report_structure.md)**, **[results prioritisation](report_structure.md)** and **[visualisation](report_structure.md)** are described more in detail in [report_structure.md](report_structure.md). + From 8ea8c8c4d62783325ba5157ea33fcb743052f763 Mon Sep 17 00:00:00 2001 From: Sehrish Date: Fri, 16 Aug 2024 16:24:35 +1000 Subject: [PATCH 2/6] update references to docx --- README.Rmd | 15 +++++++-------- README.md | 48 ++++++++++++++++++++++++------------------------ 2 files changed, 31 insertions(+), 32 deletions(-) diff --git a/README.Rmd b/README.Rmd index 64d390eb..d70dfcb0 100755 --- a/README.Rmd +++ b/README.Rmd @@ -18,8 +18,7 @@ knitr::opts_chunk$set( `RNAsum` is an R package that can post-process, summarise and visualise outputs primarily from [DRAGEN RNA][dragen-rna] pipelines. -Its main application is to complement genome-based findings from the -[umccrise][umccrise] pipeline and to provide additional evidence for detected +Its main application is to complement genome-based findings and to provide additional evidence for detected alterations. [dragen-rna]: @@ -59,7 +58,7 @@ docker pull ghcr.io/umccr/rnasum:latest ## Workflow The pipeline consists of five main components illustrated and briefly -described below. For more details, see [workflow.md](/workflow.md). +described below. For more details, see [workflow.md](./inst/articles/workflow.md). @@ -81,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md). potential druggable targets. 5. The final product is an interactive HTML report with searchable tables and plots presenting expression levels of the genes of interest. The report - consists of several sections described [here](./articles/report_structure.md). + consists of several sections described [here](./isnt/articles/report_structure.md). ## Reference data @@ -100,10 +99,10 @@ Depending on the tissue from which the patient's sample was taken, one of **33 cancer datasets** from TCGA can be used as a reference cohort for comparing expression changes in genes of interest of the patient. Additionally, 10 samples from each of the 33 TCGA datasets were combined to create the -**[Pan-Cancer dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, -and for some cohorts **[extended sets](./articles/tcga_projects_summary.md#extended-datasets)** +**[Pan-Cancer dataset](./inst/articles/articles/tcga_projects_summary.md#pan-cancer-dataset)**, +and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are also available. All available datasets are listed in the -**[TCGA projects summary table](./articles/tcga_projects_summary.md)**. These datasets +**[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets have been processed using methods described in the [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data) repository. The dataset of interest can be specified by using one of the @@ -119,7 +118,7 @@ analytical pipelines. Moreover, TCGA data may include samples from tissue material of lower quality and cellularity compared to samples processed using local protocols. To address these issues, we have built a high-quality internal reference cohort processed using the same pipelines as input data -(see [data pre-processing](./articles/workflow.md#data-processing)). +(see [data pre-processing](./inst/articles/workflow.md#data-processing)). This internal reference set of **40 pancreatic cancer samples** is based on WTS data generated at **[UMCCR](https://research.unimelb.edu.au/centre-for-cancer-research/our-research/precision-oncology-research-group)** diff --git a/README.md b/README.md index 282f598c..9fd94bbc 100755 --- a/README.md +++ b/README.md @@ -20,8 +20,7 @@ outputs primarily from [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) pipelines. Its main application is to complement genome-based findings -from the [umccrise](https://github.com/umccr/umccrise) pipeline and to -provide additional evidence for detected alterations. +and to provide additional evidence for detected alterations. **DOCS**: @@ -54,7 +53,8 @@ docker pull ghcr.io/umccr/rnasum:latest ## Workflow The pipeline consists of five main components illustrated and briefly -described below. For more details, see [workflow.md](/workflow.md). +described below. For more details, see +[workflow.md](./inst/articles/workflow.md). @@ -80,7 +80,7 @@ described below. For more details, see [workflow.md](/workflow.md). 5. The final product is an interactive HTML report with searchable tables and plots presenting expression levels of the genes of interest. The report consists of several sections described - [here](./articles/report_structure.md). + [here](./isnt/articles/report_structure.md). ## Reference data @@ -101,12 +101,12 @@ of **33 cancer datasets** from TCGA can be used as a reference cohort for comparing expression changes in genes of interest of the patient. Additionally, 10 samples from each of the 33 TCGA datasets were combined to create the **[Pan-Cancer -dataset](./articles/tcga_projects_summary.md#pan-cancer-dataset)**, and -for some cohorts **[extended -sets](./articles/tcga_projects_summary.md#extended-datasets)** are also -available. All available datasets are listed in the **[TCGA projects -summary table](./articles/tcga_projects_summary.md)**. These datasets -have been processed using methods described in the +dataset](./inst/articles/articles/tcga_projects_summary.md#pan-cancer-dataset)**, +and for some cohorts **[extended +sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are +also available. All available datasets are listed in the **[TCGA +projects summary table](./inst/articles/tcga_projects_summary.md)**. +These datasets have been processed using methods described in the [TCGA-data-harmonization](https://github.com/umccr/TCGA-data-harmonization/blob/master/expression/README.md#gdc-counts-data) repository. The dataset of interest can be specified by using one of the TCGA project IDs for the `RNAsum` `--dataset` argument (see @@ -122,7 +122,7 @@ may include samples from tissue material of lower quality and cellularity compared to samples processed using local protocols. To address these issues, we have built a high-quality internal reference cohort processed using the same pipelines as input data (see [data -pre-processing](./articles/workflow.md#data-processing)). +pre-processing](./inst/articles/workflow.md#data-processing)). This internal reference set of **40 pancreatic cancer samples** is based on WTS data generated at @@ -170,12 +170,12 @@ quantification file. The table below lists all input data accepted in `RNAsum`: -| Input file | Tool | Example | Required | -|----|----|----|----| -| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** | -| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** | -| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | -| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | +| Input file | Tool | Example | Required | +|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------| +| Quantified transcript **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.sf](/inst/rawdata/test_data/dragen/TEST.quant.sf) | **Yes** | +| Quantified gene **abundances** | [salmon](https://salmon.readthedocs.io/en/latest/salmon.html) ([description](https://salmon.readthedocs.io/en/latest/file_formats.html#fileformats)) | [\*.quant.gene.sf](/inst/rawdata/test_data/dragen/TEST.quant.gene.sf) | **Yes** | +| **Fusion gene** list | [Arriba](https://arriba.readthedocs.io/en/latest/) | [fusions.tsv](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | +| **Fusion gene** list | [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) | [\*.fusion_candidates.final](/inst/rawdata/test_data/dragen/test_sample_WTS.fusion_candidates.final) | No | ### WGS @@ -183,11 +183,11 @@ The table below lists all input data accepted in `RNAsum`: The table below lists all input data accepted in `RNAsum`: -| Input file | Tool | Example | Required | -|----|----|----|----| -| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No | -| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No | -| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No | +| Input file | Tool | Example | Required | +|-----------------|-------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|----------| +| **SNVs/Indels** | [PCGR](https://github.com/sigven/pcgr) | [pcgr.snvs_indels.tiers.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/small_variants/pcgr.snvs_indels.tiers.tsv) | No | +| **CNVs** | [PURPLE](https://github.com/hartwigmedical/hmftools/tree/master/purple) | [purple.cnv.gene.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/purple/purple.gene.cnv) | No | +| **SVs** | [Manta](https://github.com/Illumina/manta) | [sv-prioritize-manta.tsv](/inst/rawdata/test_data/umccrised/test_sample_WGS/structural/sv-prioritize-manta.tsv) | No | ## Usage @@ -197,13 +197,13 @@ export PATH="${rnasum_cli}:${PATH}" ``` $ rnasum.R --version - 1.1.0 + 0.6.1 $ rnasum.R --help Usage ===== - /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/RNAsum/cli/rnasum.R [options] + /Library/Frameworks/R.framework/Versions/4.2/Resources/library/RNAsum/cli/rnasum.R [options] Options From 0f006cd1f10b0a02e2d0260d635b459682cb78d5 Mon Sep 17 00:00:00 2001 From: Sehrish Date: Fri, 16 Aug 2024 16:30:48 +1000 Subject: [PATCH 3/6] fix typo --- README.Rmd | 2 +- README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.Rmd b/README.Rmd index d70dfcb0..9ce1d262 100755 --- a/README.Rmd +++ b/README.Rmd @@ -80,7 +80,7 @@ described below. For more details, see [workflow.md](./inst/articles/workflow.md potential druggable targets. 5. The final product is an interactive HTML report with searchable tables and plots presenting expression levels of the genes of interest. The report - consists of several sections described [here](./isnt/articles/report_structure.md). + consists of several sections described [here](./inst/articles/report_structure.md). ## Reference data diff --git a/README.md b/README.md index 9fd94bbc..e047f012 100755 --- a/README.md +++ b/README.md @@ -80,7 +80,7 @@ described below. For more details, see 5. The final product is an interactive HTML report with searchable tables and plots presenting expression levels of the genes of interest. The report consists of several sections described - [here](./isnt/articles/report_structure.md). + [here](./inst/articles/report_structure.md). ## Reference data From c290cd5c8abfba6941ebfb243ad7c8a166abdcfd Mon Sep 17 00:00:00 2001 From: Sehrish Date: Mon, 19 Aug 2024 16:31:50 +1000 Subject: [PATCH 4/6] update link to report structure --- README.Rmd | 2 +- README.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.Rmd b/README.Rmd index 9ce1d262..17d28f12 100755 --- a/README.Rmd +++ b/README.Rmd @@ -358,7 +358,7 @@ sections, including: \*\* if genome-based results are available; see `--umccrise` argument Detailed description of the report structure, including result prioritisation -and visualisation is available [here](report_structure.md). +and visualisation is available [here](./inst/articles/report_structure.md). #### Results diff --git a/README.md b/README.md index e047f012..b11c8f7a 100755 --- a/README.md +++ b/README.md @@ -455,7 +455,7 @@ argument Detailed description of the report structure, including result prioritisation and visualisation is available -[here](report_structure.md). +[here](./inst/articles/report_structure.md). #### Results From 6b58c74b50b6bd11f539337b9f52a9915f84f7f1 Mon Sep 17 00:00:00 2001 From: Sehrish Date: Thu, 29 Aug 2024 11:48:28 +1000 Subject: [PATCH 5/6] update rnasum version --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b11c8f7a..0d364b57 100755 --- a/README.md +++ b/README.md @@ -197,7 +197,7 @@ export PATH="${rnasum_cli}:${PATH}" ``` $ rnasum.R --version - 0.6.1 + 1.1.0 $ rnasum.R --help Usage From bd395a5cd7ab45572c0c56a048b62a6f2bd365fd Mon Sep 17 00:00:00 2001 From: Sehrish Date: Tue, 3 Sep 2024 15:26:39 +1000 Subject: [PATCH 6/6] fix links --- README.Rmd | 4 ++-- README.md | 6 +++--- inst/articles/workflow.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/README.Rmd b/README.Rmd index 17d28f12..ffd00009 100755 --- a/README.Rmd +++ b/README.Rmd @@ -18,7 +18,7 @@ knitr::opts_chunk$set( `RNAsum` is an R package that can post-process, summarise and visualise outputs primarily from [DRAGEN RNA][dragen-rna] pipelines. -Its main application is to complement genome-based findings and to provide additional evidence for detected +Its main application is to complement whole-genome based findings and to provide additional evidence for detected alterations. [dragen-rna]: @@ -99,7 +99,7 @@ Depending on the tissue from which the patient's sample was taken, one of **33 cancer datasets** from TCGA can be used as a reference cohort for comparing expression changes in genes of interest of the patient. Additionally, 10 samples from each of the 33 TCGA datasets were combined to create the -**[Pan-Cancer dataset](./inst/articles/articles/tcga_projects_summary.md#pan-cancer-dataset)**, +**[Pan-Cancer dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**, and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are also available. All available datasets are listed in the **[TCGA projects summary table](./inst/articles/tcga_projects_summary.md)**. These datasets diff --git a/README.md b/README.md index 0d364b57..26ee87a9 100755 --- a/README.md +++ b/README.md @@ -19,8 +19,8 @@ `RNAsum` is an R package that can post-process, summarise and visualise outputs primarily from [DRAGEN RNA](https://sapac.illumina.com/products/by-type/informatics-products/basespace-sequence-hub/apps/edico-genome-inc-dragen-rna-pipeline.html) -pipelines. Its main application is to complement genome-based findings -and to provide additional evidence for detected alterations. +pipelines. Its main application is to complement whole-genome based +findings and to provide additional evidence for detected alterations. **DOCS**: @@ -101,7 +101,7 @@ of **33 cancer datasets** from TCGA can be used as a reference cohort for comparing expression changes in genes of interest of the patient. Additionally, 10 samples from each of the 33 TCGA datasets were combined to create the **[Pan-Cancer -dataset](./inst/articles/articles/tcga_projects_summary.md#pan-cancer-dataset)**, +dataset](./inst/articles/tcga_projects_summary.md#pan-cancer-dataset)**, and for some cohorts **[extended sets](./inst/articles/tcga_projects_summary.md#extended-datasets)** are also available. All available datasets are listed in the **[TCGA diff --git a/inst/articles/workflow.md b/inst/articles/workflow.md index 3787a101..24711bb8 100644 --- a/inst/articles/workflow.md +++ b/inst/articles/workflow.md @@ -176,7 +176,7 @@ For patients with available [WGS](./README.md#wgs) data processed using *[umccri ### Key cancer genes -* [UMCCR key cancer genes set](https://github.com/vladsaveliev/NGS_Utils/blob/master/ngs_utils/reference_data/key_genes/make_umccr_cancer_genes.Rmd) build of off several sources: +* [UMCCR key cancer genes set](https://github.com/umccr/NGS_Utils/blob/master/ngs_utils/reference_data/key_genes/make_umccr_cancer_genes.Rmd) build of off several sources: * [Cancermine](http://bionlp.bcgsc.ca/cancermine/) with at least 2 publication with at least 3 citations * [NCG known cancer genes](http://ncg.kcl.ac.uk/) * Tier 1 [COSMIC Cancer Gene Census](https://cancer.sanger.ac.uk/census) (CGC)