-
Notifications
You must be signed in to change notification settings - Fork 21
FAQandTroubleshooting
For long read PacBio or Nanopore use the special long read pipeline, JAFFAL:
bpipe run <path to JAFFA>/JAFFAL.groovy <.fasta or fastq.gz files>
Working within the directory where JAFFA is installed:
- Download the relevant reference tarball from Download and untar it e.g.
tar -xvf JAFFA_REFERENCE_FILES_HG19_GENCODE19.tar.gz
- Download the corresponding reference genome file from UCSC for hg19 or mm10 and unzip it e.g.
gunzip hg19.fa.gz
- Create blast reference files by running the following command (e.g. for hg19)
tools/ncbi-blast-2.9.0+/bin/makeblastdb -in hg19_genCode19.fa -dbtype nucl -out hg19_genCode19_blast
Reference file installation in complete. You can then either:
- Edit the
genome
andannotation
fields in JAFFA_stages.groovy, or - Provide the same details during run time with the bpipe options
-p genome=<genome> -p annotation=genCode<version>
Here is an example for mouse:
bpipe run -p genome=mm10 -p annotation=genCodeVM4 <path to JAFFA>/JAFFA_direct.groovy <fastq.gz files>
If you have a favourite genome that is currently not supported write us an email and we'll looking into either providing the reference files, or follow the instructions below on how to generate them yourself.
If you want to search for fusions using a genome other than hg19, hg38 or mm10 follow the instructions below. This will only work for genomes supported by UCSC and requires you to have bedtools installed. We'll demonstrate using hg19 and the gencode annotation as an example. All files should be placed in the root directory of JAFFA.
-
Download the genome from UCSC if you don't already have it. JAFFA expects the UCSC version of the genome, in a single fasta file. So if you downloaded one file for each chromosome, you'll need to unzip and untar then combine all the chromosomal fasta files together. e.g. cat chr*.fa > hg19.fa.
-
Download the annotation from the UCSC table browser. Select the genome and annotation of interest (e.g. hg19 and GENCODE Genes V19). Note that the annotation needs to be Gencode or Ensembl, as JAFFA expects gene and transcript names to be prefixed with "EN". You will need to download the annotation in three different output formats:
-
all fields from a selected field
. Call this file<genome>_<annotation>.tab
(e.g. hg19_genCode19.tab) -
sequence
. Call this file<genome>_<annotation>.fasta
(e.g. hg19_genCode19.fasta). Select "genomic" if you are asked for the sequence type and make sure "introns" is not ticked in the retrieval region. -
BED
. Call this file<genome>_<annotation>.bed
(e.g. hg19_genCode19.bed). You should select one bed record per exon if asked.
-
-
Create a gene masked version of the genome using the bed annotation file and bedtools.
bedtools maskfasta -fi hg19.fa -fo Masked_hg19.fa -bed hg19_genCode19.bed
-
Fix the sequence fasta file. Unfortunately the spaces in the sequence IDs cause some problems for JAFFA, so we need to replace these with a double underscore. We also want the fasta file formatted with each sequence on a single line. Note that the correctly formatted file has the extension .fa rather than .fasta.
<JAFF directory>/tools/bin/reformat fastawrap=0 in=hg19_genCode19.fasta out=stdout.fa | sed 's/ /__/g' > hg19_genCode19.fa
-
Build references for bowtie2. Index the gene sequences and masked genome files for bowtie2
bowtie2-build hg19_genCode19.fa hg19_genCode19 bowtie2-build Masked_hg19.fa Masked_hg19
-
Create references for blast. Running the following command (e.g. for hg19)
tools/ncbi-blast-2.9.0+/bin/makeblastdb -in hg19_genCode19.fa -dbtype nucl -out hg19_genCode19_blast
-
Add a file of known fusions. For human the fusions found in your dataset will be compared against this list. It not really relevant for other genome, but the file is expected by JAFFA to run correctly. You could either copy the known_fusions.txt file from one of the provided reference datasets. Or create an empty file with
touch known_fusions.txt
You are now ready to go. Remember to specify the genome and annotation to JAFFA. JAFFA assumes that you have use consistent file naming like the examples above (e.g. . or Masked.).
bpipe run -p genome=hg19 -p annotation=genCode19 <path to JAFFA>/JAFFA_direct.groovy <fastq.gz files>
This will depend a lot on the depth of data (e.g. the number of bases sequenced) and on which JAFFA mode was run. The Direct mode is the least computationally expensive. Assembly and Hybrid modes can use much more memory and require longer to run because of the de novo assembly step.
Version 2 of JAFFA will process an average sized RNA-Seq sample in a few hours and around 20GB of RAM (using multiple threads).
Version 1 is single threaded and significantly slower than version 2. Some examples of the range to expect for version 1 (based on datasets we've run on):
Direct mode on 100bp paired-end reads
- Reads: 2.5-40 million
- RAM: Around 10 GB or less
- Time: 1-20 CPU hours (single thread)
Assembly mode on 50bp paired-end reads
- Reads: 14-42 million
- RAM: 5.5-18 Gb
- Time: 2.5-6 hours
Hybrid mode on 100bp paired-end reads
- Reads: 2.5-40 million
- RAM: 4-75 GB
- Time: 4-40 CPU hours
You can restrict the memory used by JAFFA through the bpipe option --memory <n>GB
. If the memory is exceeded (most likely during assembly), it will be killed by bpipe.
- From the file, jaffa_results.csv, determine the "sample", "contig" and "contig break" of the fusion of interest. The contig must start with "Locus..", i.e. be an Oases assembled contig (and not a read ID).
- Load IGV
- In IGV click on "Load Genome from File" and then navigate to the directory called
<sample
>. Click on the file<sample>
.fusions.fa - In the drop-down box which usually contains the chromosome names (or using the text box next to it, put in the name of
<contig>
) - Load the reads by clicking "Load from File". In the directory
<sample
>, select the file named<sample
>.sorted.bam - To highlight the breakpoint, go to "Regions", "Regions Navigator" and put in
<contig break
> as the start and<contig break
> +1 as the end. Then click "View".
With an error like this:
Error in `$<-.data.frame`(`*tmp*`, known, value = "-") :
replacement has 1 row, data has 0
Calls: $<- -> $<-.data.frame
This is most often caused by having an incorrect genome reference fasta file. The genome reference fasta file should be created using the instruction described under point 3. at https://github.com/Oshlack/JAFFA/wiki/HowToSetUpJAFFA#installing. This file is not the same as the transcriptome reference file which you can download from the JAFFA wiki. If this is the cause of your error, install the correct fasta, remove the file which ends with _genome.psl and rerun JAFFA.
Less commonly, this error can occur if JAFFA finds no fusions in the sample. This is unlikely for a regular depth sequencing dataset as even normal tissue produce trans-splicing calls. However this might happen for very low depth, simulation or high error rate datasets.
Check the tile size limit of the version of BLAT you have installed. JAFFA assumes a tile size up to 18, but some version of BLAT (e.g. v. 35) do not support tile sizes this large. You can get around this easily by running with the option "-p contigTile=X -p readTile=X" replacing X with the maximum supported tile size. The tile size does not impact on the accuracy of results in most cases, but a larger size will make the alignment steps faster.
JAFFA is not properly parallelising the samples or you get an error like "The pattern provided %_*.fastq.gz did not match any of the files provided as input [checks]"
You may need to change the value of the variable fastqInputFormat
which is set to %_*.fastq.gz by default. What this does is to search for files of the form %_*.fastq.gz in your input list. Both the %
and *
are wildcards (like the '*'
in bash). However bpipe will parallelise based on differences in the %
part of the name. For example, if these are your files:
sampleA_R1.fastq.gz, sampleA_R2.fastq.gz, sampleB_R1.fastq.gz, sampleB_R2.fastq.gz
The %
parts are sampleA and sampleB, so bpipe will start two parallel jobs going.
Sometimes you can run into trouble with the default pattern if either the extension (fastq.gz) is different, if there are multiple "_"
s, or no "_"
s. You may need to change the pattern to fit your case. As an example, lets say your files are called:
A_1_1.fq.gz, A_1_R2.fq.gz, B_1_R1.fq.gz, B_1_R2.fq.gz
Then you should run bpipe like this:
bpipe run -p fastqInputFormat="%_1_*.fq.gz"
If you encounter an error with JAFFA, please open a new issue in the github repository. Please supply as much of the following information as possible:
- Whether you have tried to run JAFFA on the demo data, and if so, did you also encounter the error using it?
- What command did you use to run JAFFA
- What error message was observed?
- Please report the result of "ls -l" in the sample directory where JAFFA ran (there should be a list of files such as BT474-demo.fasta, BT474-demo.psl etc.. with their file sizes)
- Please report the result of "ls -l" in the directory where JAFFA was installed
- Please attach any log files (log_blat, log_filter and log_genome_blat) found in the sample directory
- Details of your system, such as OS.
- If it's possible, supply a reproducible example (e.g. a subset of the fastq reads which recreate the same error).