-
Notifications
You must be signed in to change notification settings - Fork 7
Home
NanoFG is a fusion detection pipeline made for Oxford Nanopore Sequencing data. NanoFG uses the ENSEMBL database to find structural variations (SVs) that produce fusions between two genes. It remaps these SVs using LAST to increase the breakpoint accuracy and reports fusions. It produces a default of 4 output files:
- .vcf file containing all candidate fusion genes
- .txt file containing information on all correct fusion genes
- .pdf file containing a visual overview of the detected fusion genes
- .primers text file containing primers for fusion validation
Samtools (1.7) - http://samtools.sourceforge.net/
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Minimap2 (2.6) - https://github.com/lh3/minimap2
Li, H. (2018). Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics, 34:3094-3100.
NanoSV (1.2.4) - https://github.com/mroosmalen/nanosv
Cretu Stancu, M. et al. Mapping and phasing of structural variation in patient genomes using nanopore sequencing. Nat. Commun. 8, 1326 (2017).
LAST (921) - http://last.cbrc.jp/doc/last.html
Kielbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Research 21, 487–493 (2011).
Wtdbg2 (2.2) - https://github.com/ruanjue/wtdbg2
Ruan, J. and Li, H. (2019) Fast and accurate long-read assembly with wtdbg2. Nat Methods
- Download NanoFG from github
- From the NanoFG directory, run:
virtualenv venv -p </path/to/python>
. venv/bin/activate
pip install -r requirements.txt
bash NanoFG.sh -f </path/to/fastq> [-n sample_name ] [-s selection] [-cc] [-df] [-dc]
or
bash NanoFG.sh -b </path/to/bam> [-v </path/to/vcf>] [-n sample_name ] [-s selection] [-cc] [-df] [-dc]
The human reference fasta input NanoFG can run with is currently limited to fasta files with a setup where the name is the chromosome number:
>1 (Instead of Chr1)
NNNNN
>2
NNNNN
># etc.
Creation can be done by downloading the reference from:
and running
sed 's/>chr/>/' </PATH/TO/REFERENCE_FASTA> > </PATH/TO/RESULT_FASTA>
-f | --fastq
Path to fastq
-b | --bam
Path to bamfile
-v | --vcf
Path to vcf
-n | --name
Name of the sample to give to output files
-cf | --complex_fusion
If activated, NanoFG links together SVs that occur on the same read.
Fusions can be found where a small SV inbetween the two genes might inhibit normal NanoFG from detecting a fusion.
-s | --selection
Regions to select from the bamfiles (separated by ',')
Accepted formats:
- Direct region (e.g. 17:7565097-7590856 )
- Ensembl identifier (e.g. ENSG00000141510)
- Common gene name (e.g. TP53)
-cc | --consensus_calling
Creates a consensus of all supporting reads for a breakpoint before calling fusions.
Increases the accuracy of breakpoint detection, which is especially important for exon-exon fusions.
Only activate if there is sufficient coverage to create a consensus.
-df | --dont_filter
When activated, NanoFG does not filter breakpoints before and during its steps.
-dc | --dont_clean
When activated, NanoFG does not remove any intermediate files created during its process.
Important if you want to keep the consensus sequences of the fusion gene after running NanoFG.
-wl | --without_last
When activated, minimap2 instead of last is used for remapping fusion candidates after optional consensus creation and complex fusion detection.
LAST has previously been used as it showed more accurate read mapping over minimap2.
However, newer versions of minimap2 have shown similar qualities with massive increase in speed ad the lack of additional required files.
Mapping of the reads in the fastq file using default settings '-x map-ont -a --MD'
Selection of regions from BAM file with samtools if parameter -s|--selection is given
First, 'samtools -L region_bedfile BAM' is used to select all read names that span a certain region. These read names are then used to select all reads (primary and supplementary alignments) that are partly located in the selected region using 'samtools view BAM | grep -f file_with_read_names'.
By default, NanoSV is used to detect SVs from the minimap2 mapped reads.
Using the .vcf file created by NanoSV, the ENSEMBL database is used to annotate all breakpoints with overlapping genes. If a breakpoint overlaps with 2 different genes, it is flagged as a possible fusion gene. Using pysam, all reads that support the breakpoint are extracted.
Perform consensus calling on the extracted read for every SV if parameter -cc | --consensus_calling is given
Consensus calling is done by wtdbg2 using the parameters '-x ont -g 3g'
All extracted reads are mapped again with LAST, as LAST previously have been show to produce a slightly more accurate breakpoint position than minimap2.
SV calling is performed by NanoSV.
Perform complex fusion detection if parameter -cf|--complex_fusion is given. Multiple breakpoints that occur on the same read are linked to produce a representation of that area of the genome. The first and last break-end in the read are then reported as a additional SV, giving the possibility to find a complex fusion gene where small SVs have occurred at the fusion breakpoint that inhibit default NanoFG from detecting the fusion.
Any SV that can produce a correct fusion are determined and additionally flagged by using information of ENSEMBL and NanoSV and produce a pdf overview of all the fusions in the sample.
NanoFG produces a default of 4 output files:
- .vcf file containing all candidate fusion genes
- .txt file containing information on all correct fusion genes
- .pdf file containing a visual overview of the detected fusion genes
- .primers text file containing primers for fusion validation
Multiple settings can affect the possibility of NanoFG to detect fusion genes
-
In the NanoSV config files (in NanoFG/files/) the minimal SV supporting reads (cluster_count) needed to detect SVs is set on 2. With very low coverage, changing this to 1 might make NanoFG detect these fusions but might increase the false positive ratio
-
If a SV is located in a hard to map area, the breakpoint might be reported in different location in that region. By default, the maximum distance for NanoSV to consider two breakpoints similar is 100 (cluster_distance). Increasing this might lead to the detection of new SVs