Skip to content

End to end scoring and annotation with FlashFry

Aaron McKenna edited this page May 21, 2020 · 14 revisions

Overview

In this tutorial, we'll walk through the full process of scoring and annotation for a candidate region of the human genome. Many of the steps here take a significantly longer time than the quickstart tutorial, but this tutorial allows you to walk through the full FlashFry process on real data.

We'll use FlashFry to score the MYC protein coding region, including its up and downstream regions. We'll then annotate this will ChromHMM annotations, and look at the distribution of scores we get back in R, and create a custom BED track to load into the UCSC Genome Browser.

Index your genome

First, you'll want to index a copy of the human genome. We'll be working with hg38.fa. This takes about an hour and a half on my Google cloud machine.

java -Xmx6000m -jar FlashFry-assembly-<version_number_here>.jar \
index \
--tmpLocation ./tmp \
--database hg38_database \
--reference ../genome/hg38.fa \
--enzyme spcas9ngg

Discovering sites in the MYC coding region

Now we need a region to score. Here I've pulled the genomic region of the MYC gene using the UCSC genome browser which I've deposited as a fasta file in the test data folder. This process would be the same procedure for any other gene of interest. The first FlashFry step is to find sites within this region:

java -Xmx4g -jar FlashFry-assembly-<version_number_here>.jar \
discover \
 --database ../hg38_database \
 --fasta myc.fasta \
 --output myc.fasta.sites

Here we find 1206 potential Cas9 targets in the approximately 7K MYC sequence, including 1000 bases upstream and downstream of the coding sequence. We next want to score the sites, including annotation with an associated BED file

Scoring

Here we're interested in scoring our discovered sites with a variety of public metrics, as well as annotating our discovered sites with a BED file. In this case, we'll use the ChromHMM annotations available in the test data folder, but you could really use any BED file you'd like. If you pull the wgEncodeBroadHmmGm12878HMM.bed.gz file you'll have to unzip it first with gunzip. FlashFry currently just uses the BED name column (the 4th column) as the annotation name.

java -Xmx4g -jar FlashFry-assembly-<version_number_here>.jar \
score \
 --input myc.fasta.sites \
 --output myc.fasta.sites.scored \
 --scoringMetrics doench2014ontarget,doench2016cfd,dangerous,hsu2013,minot,bedannotator \
 --database ../hg38_database \
--inputAnnotationBed chromeHMM:wgEncodeBroadHmmGm12878HMM.bed \
--transformPositions mapping.bed

Here we needed to design a mapping file (mapping.bed) that tells FlashFry how to map the chromosomes from our FASTA file, hg38_dna, to its location within the human genome. This tag (hg38_dna) is arbitrary, we could of called it MYC, our_segment, or anything else that we liked. There just has to be a one-to-one mapping of the sequences we scanned to a location in the genome. In our case this is true, and we can map the hg38_dna tag to the correct region of chr8. A simple bed file named mapping.bed can be found in the test_data folder

chr8    127734434       127741477       hg38_dna

This BED file has the genome location in the standard tab-separated BED format, followed by the first space-separated chromosome name from our FASTA file, in this case, hg38_dna. With this file, our targets will all be remapped by FlashFry to their in-genome location.

Load up table in R and find the best candidates

We can now load our table into R and look at some characteristics of the data. The table output is fully compatible with read.delim in R., For instance, we can load up the candidate targets and look at the 'dangerous' annotation, indicating targets that may be hard to clone, express, or are seen more than once in the genome.

score.tbl = read.delim("myc.fasta.sites.scored")
summary(score.tbl)
...
                dangerous_GC dangerous_polyT  dangerous_in_genome
 NONE                  :946   NONE :1059      IN_GENOME=1:1131    
 GC_0.782608695652174  : 99   PolyT:  73      IN_GENOME=6:   1    
 GC_0.8260869565217391 : 40                                       
 GC_0.8695652173913043 : 25                                       
 GC_0.9130434782608695 : 17                                       
 GC_0.17391304347826086:  2                                       
 (Other)               :  3                       

You can look at many other score or factors to select targets and then design guides against sequences of interest in MYC. Generally, this means coming to some balance of on and off-target scores, avoiding these dangerous targets/guides, and looking at how your targets overlap your regions of interest in the gene(s).

Create a BED file as a custom track in UCSC

This can now be loaded into the UCSC genome browser by cutting the file into something that looks like a standard BED file:

sed '1d' myc.fasta.sites.scored | cut -f 1,2,3,4 > for_ucsc.bed

This can be loaded as a custom track into UCSC genome browser. The sed portion of the command drops the first line (the header), and the cut command chooses the columns to select. Here I've taken the sequence as our annotation (column 4) but this could be adjusted to any column of interest.