This document describes the output produced by the pipeline.
The pipeline is built using Nextflow and processes data using the following steps:
- Arriba
- EricScript
- FastQC
- Fusioncatcher
- Fusion Inspector
- fusion-report
- MultiQC
- Pizzly
- Squid
- Star-Fusion
Output directory: results/tools/Arriba
fusions.tsv
- contains fusions which pass all of Arriba's filters. It should be highly enriched for true predictions. The predictions are listed from highest to lowest confidence.
fusions.discarded.tsv
- contains all events that Arriba classified as an artifact or that are also observed in healthy tissue. This file may be useful, if one suspects that an event should be present, but was erroneously discarded by Arriba.
<sample>.pdf
- contains fusion visualization when opted for
--arriba_vis
- contains fusion visualization when opted for
Output directory: results/tools/Ericscript/tmp
fusions.results.filtered.tsv
- contains all the predicted gene fusions
Column | Description |
---|---|
GeneName1 | official gene name of 5' gene. |
GeneName2 | official gene name of 3' gene. |
chr1 | chromosome of 5' gene. |
Breakpoint1 | predicted breakpoint on 5' gene. |
strand1 | strand (-/+) of 5' gene. |
chr2 | chromosome of 3' gene. |
Breakpoint2 | predicted breakpoint on 3' gene. |
strand2 | strand (-/+) of 3' gene. |
EnsemblGene1 | Ensembl gene ID of 5' gene. |
EnsemblGene2 | Ensembl gene ID of 3' gene. |
crossingreads | the number of paired end discordant reads. |
spanningreads | the number of paired end reads spanning the junction. |
mean.insertsize | mean of insert sizes of crossing + spanning reads. |
homology | if filled, all the homologies between the fusion junction and Ensembl genes. |
fusiontype | intra-chromosomal, inter-chromosomal, read-through or CIS. |
InfoGene1 | gene information about 5' gene. |
InfoGene2 | gene information about 3' gene. |
JunctionSequence | predicted junction fusion sequence. |
GeneExpr1 | Read count based estimation of the expression level of 5' gene. |
GeneExpr2 | Read count based estimation of the expression level of 3' gene. |
GeneExpr_fused | Read count based estimation of the expression level of the predicted chimeric transcript. |
ES | Edge score. |
GJS | Genuine Junction score. |
US | Uniformity score. |
EricScore | EricScore score (adaboost classifier). |
For more info check the documentation.
FastQC gives general quality metrics about your reads. It provides information about the quality score distribution across your reads, the per base sequence content (%T/A/G/C). You get information about adapter contamination and other overrepresented sequences.
For further reading and documentation see the FastQC help.
NB: The FastQC plots displayed in the MultiQC report shows untrimmed reads. They may contain adapter sequence and potentially regions with low quality. To see how your reads look after trimming, look at the FastQC reports in the
trim_galore
directory.
Output directory: results/fastqc
sample_fastqc.html
- FastQC report, containing quality metrics for your untrimmed raw fastq files
zips/sample_fastqc.zip
- zip file containing the FastQC report, tab-delimited data file and plot images
Output directory: results/tools/Fusioncatcher
final-list_candidate-fusion-genes.txt
- contains all the predicted gene fusions
Column | Description |
---|---|
Gene_1_symbol(5end_fusion_partner) | Gene symbol of the 5' end fusion partner |
Gene_2_symbol_2(3end_fusion_partner) | Gene symbol of the 3' end fusion partner |
Gene_1_id(5end_fusion_partner) | Ensembl gene id of the 5' end fusion partner |
Gene_2_id(3end_fusion_partner) | Ensembl gene id of the 3' end fusion partner |
Exon_1_id(5end_fusion_partner) | Ensembl exon id of the 5' end fusion exon-exon junction |
Exon_2_id(3end_fusion_partner) | Ensembl exon id of the 3' end fusion exon-exon junction |
Fusion_point_for_gene_1(5end_fusion_partner) | Chromosomal position of the 5' end of fusion junction (chromosome:position:strand); 1-based coordinate |
Fusion_point_for_gene_2(3end_fusion_partner) | Chromosomal position of the 3' end of fusion junction (chromosome:position:strand); 1-based coordinate |
Spanning_pairs | Count of pairs of reads supporting the fusion (including also the multimapping reads) |
Spanning_unique_reads | Count of unique reads (i.e. unique mapping positions) mapping on the fusion junction. Shortly, here are counted all the reads which map on fusion junction minus the PCR duplicated reads. |
Longest_anchor_found | Longest anchor (hangover) found among the unique reads mapping on the fusion junction |
Fusion_finding_method | Aligning method used for mapping the reads and finding the fusion genes. Here are two methods used which are: (i) BOWTIE = only Bowtie aligner is used for mapping the reads on the genome and exon-exon fusion junctions, (ii) BOWTIE+BLAT = Bowtie aligner is used for mapping reads on the genome and BLAT is used for mapping reads for finding the fusion junction, (iii) BOWTIE+STAR = Bowtie aligner is used for mapping reads on the genome and STAR is used for mapping reads for finding the fusion junction, (iv) BOWTIE+BOWTIE2 = Bowtie aligner is used for mapping reads on the genome and Bowtie2 is used for mapping reads for finding the fusion junction. |
Fusion_sequence | The inferred fusion junction (the asterisk sign marks the junction point) |
Fusion_description | Type of the fusion gene (see the Table 2) |
Counts_of_common_mapping_reads | Count of reads mapping simultaneously on both genes which form the fusion gene. This is an indication how similar are the DNA/RNA sequences of the genes forming the fusion gene (i.e. what is their homology because highly homologous genes tend to appear show as candidate fusion genes). In case of completely different sequences of the genes involved in forming a fusion gene then here it is expected to have the value zero. |
Predicted_effect | Predicted effect of the candidate fusion gene using the annotation from Ensembl database. This is shown in format effect_gene_1/effect_gene_2, where the possible values for effect_gene_1 or effect_gene_2 are: intergenic, intronic, exonic(no-known-CDS), UTR, CDS(not-reliable-start-or-end), CDS(truncated), or CDS(complete). In case that the fusion junction for both genes is within their CDS (coding sequence) then only the values in-frame or out-of-frame will be shown. |
Predicted_fused_transcripts | All possible known fused transcripts in format ENSEMBL-TRANSCRIPT-1:POSITION-1/ENSEMBLE-TRANSCRIPT-B:POSITION-2, where are fused the sequence 1:POSITION-1 of transcript ENSEMBL-TRANSCRIPT-1 with sequence POSITION-2:END of transcript ENSEMBL-TRANSCRIPT-2 |
Predicted_fused_proteins | Predicted amino acid sequences of all possible fused proteins (separated by ";"). |
For more info check the documentation.
Output directory: results/tools/FusionInspector
finspector.fa
- the candidate fusion-gene contigs (if you copy things elsewhere, make sure to also copy the index file:
finspector.fa.fai
)
- the candidate fusion-gene contigs (if you copy things elsewhere, make sure to also copy the index file:
finspector.bed
- the reference gene structure annotations for fusion partners
finspector.junction_reads.bam
- alignments of the breakpoint-junction supporting reads.
finspector.spanning_reads.bam
- alignments of the breakpoint-spanning paired-end reads.
To visualize fusion genes in IGV tool first create a genome Menu->Genomes->Create .genome File
, choose name and description, then choose the following files:
finspector.fa
- make sure the index file finspector.fa.fai is in the same folder
finspector.gtf
- use this for 'Genes'
cytoBand.txt
- use this for 'optional Cytoband'
Add the bam files by choosing File->Load from File
and make sure to select your generated mini genome in the upper-left corner.
For more info and help check wiki page.
Output directory: results/Report-<READS_BASE_NAME>
fusions.json
- contains all main information about found fusions (fusion name, score, explanation of the score calculation, cherry picked output from fusion tools)
index.html
- main dashboard containing the list of all detected fusions
*.html
- each fusion gets a custom page with fetched data from the local database
fusions_list_filtered.txt
- filtered list of found fusions (uses tool cutoff as filter, by default: 2, can be adjusted by adding
-t <num>
when running the tool)
- filtered list of found fusions (uses tool cutoff as filter, by default: 2, can be adjusted by adding
fusions_list.txt
- unfiltered list of found fusions
Graphs displaying ratio of fusion genes caught by different tools. The last part all tools is an intersection of all tools.
Displays how many fusions were found in a downloaded databases of the summary report.
For each fusion a sum of detected tools is calculated. This counts are then visualized in the graph below.
MultiQC is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in within the report data directory.
The pipeline has special steps which allow the software versions used to be reported in the MultiQC output for future traceability.
Output directory: results/multiqc
Project_multiqc_report.html
- MultiQC report - a standalone HTML file that can be viewed in your web browser
Project_multiqc_data/
- Directory containing parsed statistics from the different tools used in the pipeline
For more information about how to use MultiQC reports, see http://multiqc.info
Output directory: results/tools/Pizzly
pizzly_fusions.json
- contains all the predicted gene fusions
Column | Description |
---|---|
geneA | id : reference id and name : gene name |
geneB | Describes reference id and gene name |
paircount | Number of paired count |
splitcount | Number of split count |
transcripts | List of all transcripts fasta_record , transcriptA , transcriptB , support , reads |
readpairs | List of read pairs containing (type , read1 , read2 ) |
For more info check the documentation.
Output directory: results/tools/Squid
fusions_annotated.txt
- contains all the predicted gene fusions
Column | Description |
---|---|
chr1 | chromosome name of the first breakpoint. |
start1 | starting position of the segment of the first breakpoint, or the predicted breakpoint position if strand1 is "-" |
end1 | ending position of the segment of the first breakpoint, or the predicted breakpoint position if strand1 is "+" |
chr2 | chromosome name of the second breakpoint |
start2 | starting position of the segment of the second breakpoint, or the predicted breakpoint position if strand2 is "-" |
end2 | ending position of the segment of the second breakpoint, or the predicted breakpoint position if strand2 is "+" |
name | TSV is not named yet, this column shows with dot. |
score | number of reads supporting this TSV (without weighted by Discordant edge ratio multiplier) |
strand1 | strand of the first segment in TSV. |
strand2 | strand of the second segment in TSV. |
num_concordantfrag_bp1 | number of concordant paired-end reads covering the first breakpoint. For a concordant paired-end read, it includes two ends and a inserted region in between, if any of the 3 regions covers the breakpoint, the read is counted in this number |
num_concordantfrag_bp2 | number of concordant paired-end reads covering the second breakpoint. The count is defined in the same way as num_concordantfrag_bp1 |
For more info check the documentation.
Output directory: results/tools/StarFusion
star-fusion.fusion_predictions.tsv
- contains all the predicted gene fusions
Column | Description |
---|---|
JunctionReadCount | Indicates the number of RNA-Seq fragments containing a read that aligns as a split read at the site of the putative fusion junction. |
SpanningFragCount | Indicates the number of RNA-Seq fragments that encompass the fusion junction such that one read of the pair aligns to a different gene than the other paired-end read of that fragment. |
SpliceType | Indicates whether the proposed breakpoint occurs at reference exon junctions as provided by the reference transcript structure annotations (ex. gencode). |
LeftGene | |
LeftBreakpoint | |
RightGene | |
RightBreakpoint | |
LargeAnchorSupport | column indicates whether there are split reads that provide 'long' (set to length of 25 bases) alignments on both sides of the putative breakpoint. |
FFPM | fusion fragments per million total reads; Default: 0.1 (meaning at least 1 fusion-supporting rna-seq fragment per 10M total reads); TL;DR: can be adjusted by changing --min_FFPM |
LeftBreakDinuc | |
LeftBreakEntropy | Represents Shannon entropy |
RightBreakDinuc | |
RightBreakEntropy | Represents Shannon entropy |
annots | Annotation generated by FusionAnnotar |
For more info check the documentation.