-
Notifications
You must be signed in to change notification settings - Fork 10
End to end scoring and annotation with FlashFry
In this tutorial, we'll walk through the full process of scoring and annotation for a candidate region of the human genome. Many of the steps here take a significantly longer time than the quickstart tutorial, but this tutorial allows you to walk through the full FlashFry process on real data.
We'll use FlashFry to score the MYC protein coding region, including its up and downstream regions. We'll then annotate this will ChromHMM annotations, and look at the distribution of scores we get back in R, and create a custom BED track to load into the UCSC Genome Browser.
First, you'll want to index a copy of the human genome. We'll be working with hg38.fa
. This takes about an hour and a half on my Google cloud machine.
java -Xmx6000m -jar FlashFry-assembly-<version_number_here>.jar \
index \
--tmpLocation ./tmp \
--database hg38_database \
--reference ../genome/hg38.fa \
--enzyme spcas9ngg
Now we need a region to score. Here I've pulled the genomic region of the MYC gene using the UCSC genome browser which I've deposited as a fasta file in the test data folder. This process would be the same procedure for any other gene of interest. The first FlashFry step is to find sites within this region:
java -Xmx4g -jar FlashFry-assembly-<version_number_here>.jar \
discover \
--database ../hg38_database \
--fasta myc.fasta \
--output myc.fasta.sites
Here we find 1206 potential Cas9 targets in the approximately 7K MYC sequence, including 1000 bases upstream and downstream of the coding sequence. We next want to score the sites, including annotation with an associated BED file
Here we're interested in scoring our discovered sites with a variety of public metrics, as well as annotating our discovered sites with a BED file. In this case, we'll use the ChromHMM annotations available in the test data folder, but you could really use any BED file you'd like. If you pull the wgEncodeBroadHmmGm12878HMM.bed.gz file you'll have to unzip it first with gunzip. FlashFry currently just uses the BED name column (the 4th column) as the annotation name.
java -Xmx4g -jar FlashFry-assembly-<version_number_here>.jar \
score \
--input myc.fasta.sites \
--output myc.fasta.sites.scored \
--scoringMetrics doench2014ontarget,doench2016cfd,dangerous,hsu2013,minot,bedannotator \
--database ../hg38_database \
--inputAnnotationBed chromeHMM:wgEncodeBroadHmmGm12878HMM.bed \
--transformPositions mapping.bed
Here we needed to design a mapping file (mapping.bed
) that tells FlashFry how to map the chromosomes from our FASTA file, hg38_dna, to its location within the human genome. This tag (hg38_dna) is arbitrary, we could of called it MYC, our_segment, or anything else that we liked. There just has to be a one-to-one mapping of the sequences we scanned to a location in the genome. In our case this is true, and we can map the hg38_dna tag to the correct region of chr8. A simple bed file named mapping.bed can be found in the test_data folder
chr8 127734434 127741477 hg38_dna
This BED file has the genome location in the standard tab-separated BED format, followed by the first space-separated chromosome name from our FASTA file, in this case, hg38_dna. With this file, our targets will all be remapped by FlashFry to their in-genome location.
We can now load our table into R and look at some characteristics of the data. The table output is fully compatible with read.delim
in R., For instance, we can load up the candidate targets and look at the 'dangerous' annotation, indicating targets that may be hard to clone, express, or are seen more than once in the genome.
score.tbl = read.delim("myc.fasta.sites.scored")
summary(score.tbl)
...
dangerous_GC dangerous_polyT dangerous_in_genome
NONE :946 NONE :1059 IN_GENOME=1:1131
GC_0.782608695652174 : 99 PolyT: 73 IN_GENOME=6: 1
GC_0.8260869565217391 : 40
GC_0.8695652173913043 : 25
GC_0.9130434782608695 : 17
GC_0.17391304347826086: 2
(Other) : 3
You can look at many other score or factors to select targets and then design guides against sequences of interest in MYC. Generally, this means coming to some balance of on and off-target scores, avoiding these dangerous targets/guides, and looking at how your targets overlap your regions of interest in the gene(s).
This can now be loaded into the UCSC genome browser by cutting the file into something that looks like a standard BED file:
sed '1d' myc.fasta.sites.scored | cut -f 1,2,3,4 > for_ucsc.bed
This can be loaded as a custom track into UCSC genome browser. The sed portion of the command drops the first line (the header), and the cut command chooses the columns to select. Here I've taken the sequence as our annotation (column 4) but this could be adjusted to any column of interest.