analysis.Rmd

---
title: "HG002 PacBio CCS Small Variants"
author: "Nate Olson"
date: '`r Sys.Date()`'
output: 
    bookdown::html_document2:
        toc: true
        toc_float: true
        df_print: paged
        code_folding: hide
---

```{r load_packages, message = FALSE, echo = FALSE}
library(VariantAnnotation)
library(BSgenome.Hsapiens.1000genomes.hs37d5)
library(happyR)
library(tidyverse)
library(googlesheets)
```

# Background

* Analysis of small variant callsets generated from PacBio CCS data. 
* The long read - low error rate data can potentially correct mistakes in and expand the HG002 benchmark set.  
* datasets - PacBio CCS 15kb libraries for HG002  
* variant callsets
  * Five SNV+indel callsets: 
    * GATK4, 
    * DeepVariant, 
    * DeepVariant trained using with haplotype information, along with 
    * GATK4 re-genotyped with WhatsHap,
    * DeepVariant re-genotyped with WhatsHap. 
* Variant calls were made against the GRCh37 reference.  
* benchmarking - Callsets benchmarked against the HG002 V3.3.2 benchmark callset using the precisionFDA app, vcfeval + hap.py with GA4GH custom stratifications.  

# Analysis Overview
1. Variant caller performance against NIST HG002 v3.3.2
1. Benchmark expansion estimate.  
1. Identification of potential mistakes in current benchmark
1. Errors in the CCS calls


# Approach/ Methods
## Benchmarking 
- Stratified comparison to NIST HG002 small variant benchmark v3.3.2.   
- Summarize overall performance
- Identify where callsets perform well and poorly
- Haplotype and re-genotype utility


## Manual Curation
A number of false positive and negative sites were randomly sampled to further evaluate the potential for using CCS callsets to correct errors in benchmark sets.
A total of 60 variants were randomly sampled for manual curation.  

- 5 from each set 
  - FP and FN  
  - SNP and Indels  
  - Target categories and other categories  

Along with 5 SNP and 5 Indel FP allele match errors.  

Target categories - lowcmp_AllRepeats_gt95identity_slop5 and lowcmp_SimpleRepeat_imperfecthomopolymer_gt10_slop5  
Target categories defined to include most homopolymers as based on our preliminary analysis these were the largest source of errors in the variant callsets. 

## Benchmark Region Expansion Estimate
_Estimating number of how many extra variants and regions CCS might help add to the GIAB benchmark._

<!-- TODO - describe method used to calculate region and variant number -->
Used CallableLoci - see `scripts/ccs_callableLoci.sh`  
Subtracted stratifications for homopolymers and excluding HG002 SVs and segmental duplications (superdups) then compare to benchmark regions - see `scripts/calc_extend_bench.sh`. 
Calculate (non-N) genome coverage for resulting regions. 
This estimate represents an upper limit for what how much we expect the CCS data to expand the current benchmark regions.

## Errors in NIST HG002 v3.3.2
Example LINE (IGV screenshot) and an estimate of total count. 

## Loading and Tidying Data

### Benchmarking  

```{r message = FALSE}
## Loading data
hap_list <- list(
    GATK4 = "data/happy_output/results/result_1",
    DeepVar = "data/happy_output/results/result_2",
    DeepVarHap = "data/happy_output_deepVarHap/results/result_1",
    GATK4_retyped = "data/happy_output_jebler_retyped/results/result_1",
    DeepVar_retyped = "data/happy_output_jebler_retyped/results/result_2") %>%
    map(read_happy)

## Creating a tidy data frame
extend_df <- hap_list %>% 
    map("extended") %>% 
    bind_rows(.id = "query_method")

ext_trim_df <- extend_df %>% 
    ## Excluding non-HG002 genome specific stratifications
    filter(!str_detect(Subset, "HG00[1,3,4,5]")) %>% 
    ## excluding strat with < 1000 obs - lack of statistical power
    filter(Subset.Size > 1000) %>% 
    ## Subset must have at least one variant in region
    filter(Subset.IS_CONF.Size > 0)
```

# Results for Manuscript

## Small Variant Detection

### Table 1 - High level metrics by callset

```{r}
benchmark_metrics_tbl <- ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Subset %in% "*") %>% 
    filter(Filter == "PASS") %>% 
    rename(Callset = query_method,
           Recall = METRIC.Recall, 
           Precision = METRIC.Precision, 
           F1 = METRIC.F1_Score) %>%
    dplyr::select(Callset, Type, Recall, Precision, F1) %>% 
    arrange(Type, -F1) 
```

```{r}
write_csv(benchmark_metrics_tbl, "results/benchmark_metrics_tbl.csv")
```


Metrics for high level comparison
```{r}
benchmark_metrics_tbl
```

```{r benchSummary, fig.cap = "Benchmarking performance metrics for CCS variant callsets for HG002 benchmark regions, Diff- all difficult regions, and Not Diff - not in difficult regions."}
## Do we have all difficult regions not in HG002allgenomespecific
high_level_strat <- c("*",
                       "alldifficultregions",
                       "notinalldifficultregions")

bench_overview_df <- ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(Subset %in% high_level_strat) %>% 
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
  mutate(Subset = fct_recode(Subset, "All" = "*", 
                             "Diff" = "alldifficultregions", 
                             "Not Diff" = "notinalldifficultregions"))


bench_overview_df %>% 
    ggplot(aes(x = Subset, y = Value, 
               fill = query_method, group=query_method)) + 
  geom_linerange(aes(ymin = 0.00001, ymax = Value),
                     position=position_dodge(width=0.5)) + 
  geom_point(shape = 21, color = "grey40", position=position_dodge(width=0.5)) + 
  scale_y_log10() +
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom") + 
  annotation_logticks(sides = "l") +
  labs(fill = "Callset")
```

```{r}
ggsave("results/benchmark_highlevel_strats.png", 
       dpi = 600,width = 6, height = 3)
```

### Indel Stratification
__TODO__
* Stratification indel numbers for homopolymers > 2bp - JZ will provide code and input files for analysis
    - for use in estimating the percentage of discordant indels in homopolymer runs 


### Supplemental Figure R4-1 - Key Stratification Results
* Add text with benchmark stratification results (1 - 2 sentences)
* Supplemental Figure R4-1: Key stratification results including overview and homopolymers

Overall high precision and recall for SNPs, and indels not in difficult regions. 

```{r fig.cap = "Performance metrics stratification results."}
stratifications <- c("*",
                       "alldifficultregions",
                       "lowcmp_AllRepeats_gt95identity_slop5",
                       "lowcmp_SimpleRepeat_imperfecthomopolymer_gt10_slop5",
                       # "lowcmp_SimpleRepeat_homopolymer_6to10",
                       # "lowcmp_SimpleRepeat_homopolymer_gt10",
                       # "map_l250_m0_e0",
                       "notinalldifficultregions",
                       "notinlowcmp_AllRepeats_gt95identity_slop5")

bench_strat_df <- ext_trim_df %>% 
    filter(Subtype == "*", query_method == "DeepVar") %>% 
    filter(Filter == "PASS") %>% 
    filter(Subset %in% stratifications) %>% 
    dplyr::select(Type, Subset, query_method,
                  contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", 
                            paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 
                           1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
  ## Relevel factors
  mutate(Subset = fct_recode(Subset, 
                             "All" = "*", 
                             "Diff" = "alldifficultregions", 
                             "Not Diff" = "notinalldifficultregions",
                             "All Reps." = "lowcmp_AllRepeats_gt95identity_slop5",
                             "Not All Reps." = "notinlowcmp_AllRepeats_gt95identity_slop5",
                       "Imp. Homopolymer" = "lowcmp_SimpleRepeat_imperfecthomopolymer_gt10_slop5" #,
                       # "Homopolymer 6 - 10 bp" = "lowcmp_SimpleRepeat_homopolymer_6to10",
                       # "Homopolymer >10 bp" = "lowcmp_SimpleRepeat_homopolymer_gt10",
                       # "Map 250" = "map_l250_m0_e0"
                       )
         )


bench_strat_df %>% 
    ggplot(aes(x = Subset, y = Value, 
               fill = Subset, 
               group=query_method)) + 
  geom_linerange(aes(ymin = 0.000025, ymax = Value),
                     position=position_dodge(width=0.5)) + 
  geom_point(shape = 21, color = "grey20", position=position_dodge(width=0.5)) + 
  scale_y_log10() +
  scale_fill_brewer(type = "qual", palette = 2) +
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom", 
          # axis.text.x = element_text(angle = -45, hjust = 0)
          axis.text.x = element_blank(),
          axis.title.x = element_blank()) + 
  annotation_logticks(sides = "l") +
  labs(fill = "Stratifications")
```

```{r}
ggsave("results/benchmark_strats.png", 
       dpi = 600,width = 8, height = 4)
```


Based on stratified analysis of the benchmarking results variant caller accuracy was poor for low complexity repeats with greater than 95% similarity (including homopolymers and tandem repeats). Variant caller performance was similar for indels when excluding low complexity repeats with > 95% similarity to when excluding all difficult regions indicating that this stratification is responsible for most of the discrepancies between the HG002 v3.3.2 benchmark callset and the CCS callsets. 


```{r fig.cap = "Performance in homopolymers."}
ext_trim_df %>% 
    filter(Subtype == "*", Filter == "PASS", query_method == "DeepVar") %>% 
    filter(str_detect(Subset,"homopolymer")) %>% 
    filter(!str_detect(Subset,"_unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
        dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = query_method, y = Value, fill = Subset), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom",axis.text.x = element_text(angle = -45, hjust = 0)) + 
    labs(x = "Simple Repeat Type", fill = "Homopolymer Type")
```

```{r fig.cap = "Benchmarking metrics for TRDB." }
ext_trim_df %>% 
    filter(Subtype == "*", query_method == "DeepVar") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"Human_Full_Genome"),
           !str_detect(Subset,"unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>%
    mutate(Subset = str_remove(Subset, "lowcmp_Human_Full_Genome_TRDB_hg19_150331_")) %>%
    mutate(Subset = str_remove(Subset, "_gt95identity_merged")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>%
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "lt", " <")) %>%
    mutate(Subset = str_replace(Subset, "gt", " >")) %>%
        dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) + 
    labs(x = "Tandem Repeat Type", fill = "Callset")
```


## Improving Small Variant Detection with Haplotype Phasing

Impact of re-genotyping on benchmarking results (Table 1, Supplemental Fig R6-1)

```{r whatHapEff, fig.cap = "Impact of re-genotyping using WhatsHap on GATK4 and DeepVariant callset benchmark metrics. Diff- all difficult regions, and Not Diff - not in difficult regions."}

## Do we have all difficult regions not in HG002allgenomespecific
high_level_strat <- c("*",
                       "alldifficultregions",
                       "notinalldifficultregions")

bench_overview_df <- ext_trim_df %>% 
    filter(Subtype == "*", query_method %in% c("GATK4","DeepVar", "GATK4_retyped","DeepVar_retyped")) %>% 
    filter(Filter == "PASS") %>% 
    filter(Subset %in% high_level_strat) %>% 
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
  mutate(Subset = fct_recode(Subset, "All" = "*", 
                             "Diff" = "alldifficultregions", 
                             "Not Diff" = "notinalldifficultregions"))


bench_overview_df %>% 
    ggplot() + geom_col(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge") + 
  scale_y_log10() +
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom") + 
  labs(fill = "Callset")
```

## Revising and expanding GIAB

# Initial Analysis


```{r, fig.cap = "Precision v. recall for a subset of the stratifications.", eval = FALSE}
strat_of_interest <- c("*","HG002allgenomespecific",
                       "HG002allgenomespecificandifficult",
                       "alldifficultregions","decoy", 
                     "lowcmp_AllRepeats_51to200bp_gt95identity_merged_slop5",
                       "lowcmp_AllRepeats_gt200bp_gt95identity_merged_slop5",
                       "lowcmp_AllRepeats_gt95identity_slop5",
                    "lowcmp_AllRepeats_lt51bpTRs_gt95identity_merged_slop5",
                       "lowcmp_AllRepeats_lt51bp_gt95identity_merged_slop5",
                       "lowcmp_SimpleRepeat_homopolymer_6to10",
                       "lowcmp_SimpleRepeat_homopolymer_gt10",
                       "lowcmp_SimpleRepeat_imperfecthomopolymer_gt10_slop5",
                       "map_all","map_l250_m0_e0",
                       "notinHG002allgenomespecificandifficult",
                       "notinalldifficultregions",
                       "notinlowcmp_AllRepeats_gt95identity_slop5","segdup")

gg <- ext_trim_df %>% 
    filter(Type == "INDEL", Subtype == "*", Filter == "PASS") %>% 
    filter(Subset %in% strat_of_interest) %>% 
    ggplot() + 
    geom_point(aes(x = METRIC.Recall, 
                   y = METRIC.Precision,
                   alpha = log10(Subset.Size),
                   text = Subset),
               shape = 19, stroke = 0.25) + 
    facet_wrap(~query_method) + 
    theme_bw() + 
    labs(x = "Recall", y = "Precision")
plotly::ggplotly(gg)
```

```{r eval = FALSE}
gg <- ext_trim_df %>% 
    filter(Type == "SNP", Subtype == "*", Filter == "PASS") %>% 
    filter(Subset %in% strat_of_interest) %>% 
    ggplot() + 
    geom_point(aes(x = METRIC.Recall, 
                   y = METRIC.Precision,
                   alpha = log10(Subset.Size),
                   text = Subset),
               shape = 19, stroke = 0.25) + 
    facet_wrap(~query_method) + 
    theme_bw() + 
    labs(x = "Recall", y = "Precision")
plotly::ggplotly(gg)
```


```{r eval = FALSE}
ext_trim_df %>% 
    filter(Subtype == "*", Filter == "PASS") %>% 
    filter(Subset %in% strat_of_interest) %>% 
    dplyr::select(query_method, Type, Subset, contains("METRIC")) %>% 
    DT::datatable(caption = "The three query methods (variant callsets) performance metrics for high level stratifications and stratifications with know poor performance, e.g. homopolymers.")
```


```{r eval = FALSE}
ext_trim_df %>% 
    filter(Type == "INDEL", Filter == "PASS") %>% 
    filter(Subset %in% strat_of_interest) %>% 
    filter(TRUTH.TOTAL > 0)%>% 
    dplyr::select(query_method, Type, Subtype,Subset, contains("METRIC"), TRUTH.TOTAL) %>%
     DT::datatable()
```

<!-- 
__TODO__ Additional figure showing stratifications where the methods perform well and poorly 
- stratifications with low recall and precision
-->

Performance metrics are consistent across homopolymer units for homopolymers between 6 and 10 bp but but varies for homopolymers greater than 10 bp. 

<!-- 
TODO - Only show for DeepVar Hap, add metrics for `*` and difficultregions 
-->
```{r fig.cap = "Performance in homopolymers."}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"homopolymer")) %>% 
    filter(!str_detect(Subset,"_unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
        dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = query_method, y = Value, fill = Subset), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom",axis.text.x = element_text(angle = -45, hjust = 0)) + 
    labs(x = "Simple Repeat Type", fill = "Homopolymer Type")
```

Performance varies by homopolymer unit.
Similar performance is expected for A and T as observed.
For G and C similar performance is not observed, inconsistent with expectation.

```{r fig.cap = "DeepVariant performance metrics for homopolymers by unit.", eval = FALSE}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"homopolymer")) %>% 
    filter(str_detect(Subset,"_unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    separate(Subset, c("Size", "Unit"), sep = "_") %>% 
    mutate(Unit = str_remove(Unit, "unit=")) %>% 
    mutate(Size = str_remove(Size, "homopolymer ")) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    filter(query_method == "DeepVar") %>% 
    ggplot() + geom_bar(aes(x = Size, y = Value, fill = Unit), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom") + 
    labs(x = "Simple Repeat Type", fill = "Callset")
```

```{r fig.cap = "DeepVariant with haplotype informaton performance metrics for homopolymers by unit."}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"homopolymer")) %>% 
    filter(str_detect(Subset,"_unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    separate(Subset, c("Size", "Unit"), sep = "_") %>% 
    mutate(Unit = str_remove(Unit, "unit=")) %>% 
    mutate(Size = str_remove(Size, "homopolymer ")) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    filter(query_method == "DeepVarHap") %>% 
    ggplot() + geom_bar(aes(x = Size, y = Value, fill = Unit), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom") + 
    labs(x = "Simple Repeat Type", fill = "Callset")
```

```{r fig.cap = "GATK4 performance metrics for homopolymers by unit.", eval = FALSE}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"homopolymer")) %>% 
    filter(str_detect(Subset,"_unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    separate(Subset, c("Size", "Unit"), sep = "_") %>% 
    mutate(Unit = str_remove(Unit, "unit=")) %>% 
    mutate(Size = str_remove(Size, "homopolymer ")) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    filter(query_method == "GATK4") %>% 
    mutate(Unit = fct_relevel(Unit, c("A","T","C","G"))) %>% 
    ggplot() + geom_bar(aes(x = Size, y = Value, fill = Unit), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(legend.position = "bottom") + 
    labs(x = "Simple Repeat Type", fill = "Unit")
```


<!-- DeepVariant worse recall for longer insertions.  -->
```{r eval = FALSE}
ext_trim_df %>% 
    filter(Type == "INDEL") %>% 
    filter(Filter == "PASS") %>% 
    filter(!str_detect(Subtype, "C")) %>% 
    filter(str_detect(Subset,"homopolymer"),
           !str_detect(Subset,"unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
    dplyr::select(Subtype, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Subtype, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA", Metric != "1 - F1_Score") %>% 
    ggplot() + geom_bar(aes(x = Subtype, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Metric~Subset, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90),
          legend.position = "bottom") + 
    labs(x = "Indel Subtype", fill = "Callset")
```


## Manual Curration 

```{r message = FALSE}
get_var_df <- function(vcffile){
    ## Read VCF
    vcf <- readVcf(vcffile, genome = "BSgenome.Hsapiens.1000genomes.hs37d5")

    ## Generate tidy data frame
    ## Truth table classification
    tt_class_df <- geno(vcf)[['BD']] %>%
        as.data.frame() %>%
        rownames_to_column(var = "variant") %>%
        filter(QUERY == "FP" | TRUTH == "FN") %>%
        as_tibble()

    ## Subsetting vcf for FP and FN
    fpfn_positions <- str_remove(tt_class_df$variant, "_.*")
    vcf_positions <- rownames(vcf) %>% str_remove("_.*")
    fpfn_vcf <- vcf[vcf_positions %in% fpfn_positions ]
    
    ## Truth table classifications for Fp and Fn positions
     tt_fpfn_df <- geno(fpfn_vcf )[['BD']] %>%
        as.data.frame() %>%
        rownames_to_column(var = "variant") %>%
        as_tibble()
     
    ## Benchmarking Type 
    bk_type_df <- geno(fpfn_vcf)[['BK']] %>% 
        as.data.frame() %>%
        rownames_to_column(var = "variant") %>% 
        dplyr::rename(Q.bk = QUERY, T.bk = TRUTH) %>%
        as_tibble() %>% 
        left_join(tt_class_df, by = c("variant" = "variant"))
    
    ## Genotype 
    gt_type_df <- geno(fpfn_vcf)[['GT']] %>% 
        as.data.frame() %>%
        rownames_to_column(var = "variant") %>%
        dplyr::rename(Q.gt = QUERY, T.gt = TRUTH) %>% 
        as_tibble() %>% 
        left_join(bk_type_df)
    
    ## Variant type
    fpfn_df <- geno(fpfn_vcf)[['BVT']] %>%
        as.data.frame() %>%
        rownames_to_column(var = "variant") %>%
        dplyr::rename(Q.Var = QUERY, T.Var = TRUTH) %>%
        left_join(gt_type_df)

    ## Combining into single data frame
    fpfn_df %>% add_column(FILTER = rowRanges(fpfn_vcf)$FILTER,
                           Regions = as.list(info(fpfn_vcf)[['Regions']]))
}

annotate_var_df <-  function(var_df){
    var_anno_df <- var_df %>% 
        mutate(allrepeats = map_lgl(Regions, ~("lowcmp_AllRepeats_gt95identity_slop5" %in% .)),
               imphomo = map_lgl(Regions, ~("lowcmp_SimpleRepeat_imperfecthomopolymer_gt10_slop5" %in% .)),
               target_cat = if_else(allrepeats + imphomo == 0, "non-target","target")) %>%
        dplyr::select(-allrepeats, -imphomo)

    mutate(var_anno_df, 
           var_cat = case_when( 
               (Q.bk == "am" | T.bk == "am") & (Q.Var == "INDEL" | T.Var == "INDEL") ~ "AM.INDEL",
               (Q.bk == "am" | T.bk == "am") & (Q.Var == "SNP" | T.Var == "SNP") ~ "AM.SNP",
               # Q.bk == "am" | T.bk == "am" ~ "AM",
               QUERY == "FP" & Q.Var == "INDEL" ~ "FP.INDEL",
               QUERY == "FP" & Q.Var == "SNP" ~ "FP.SNP",
               QUERY == "." & T.Var == "INDEL" ~ "FN.INDEL",
               QUERY == "." & T.Var == "SNP" ~ "FN.SNP",
               TRUE ~ "other")
           )
}
```


```{r}
## Getting random subset
anno_df <- list(gatk = "data/happy_output/results/result_1.vcf.gz",
                deep = "data/happy_output/results/result_2.vcf.gz",
                deepHap = "data/happy_output_deepVarHap/results/result_1.vcf.gz") %>% 
    map(get_var_df) %>% 
    map_dfr(annotate_var_df, .id = "callset") %>% 
    # Cleaner variant ids
    mutate(CHROM = str_extract(variant, ".*(?=:)"), 
           POS = str_extract(variant, "(?<=:).*(?=_)"), 
           ALT = str_extract(variant, "(?<=_).*"))
    
set.seed(531)
var_random_df <- anno_df %>% 
    dplyr::filter(var_cat != "other") %>% 
    mutate(CHROM = str_extract(variant, ".*(?=:)")) %>% 
    mutate(POS = str_extract(variant, "(?<=:).*(?=_)")) %>% 
    ## Random subsetting to include multiple variant groups
    group_by(callset, var_cat, target_cat) %>%
    sample_n(size = 5) %>%
    ungroup() %>% 
    ## Includes ALTs at the same position
    dplyr::select(callset, CHROM, POS) %>%
    left_join(anno_df)

deepvar_random_df <- var_random_df %>% 
    filter(callset == "deepHap") %>%
    mutate(variant = str_remove(variant, ".*_")) %>% 
    dplyr::rename(VAR = variant) %>% 
    dplyr::select(-Regions, -ALT) %>% 
    select(callset, target_cat, var_cat, CHROM, POS, everything())
```

```{r}
## Total number of variants in each variant category for target and non-target regions.
anno_df %>%
    dplyr::select(callset, CHROM, POS, target_cat, var_cat) %>% 
    distinct() %>% 
    filter(callset == "deepHap") %>% 
    group_by(target_cat, var_cat) %>% 
    summarise(count = n()) %>% 
    spread(target_cat, count) 
```

```{r fig.height = 12, fig.width = 12, fig.cap = "Heatmap with stratifications for the random variant subset selected for manual curation.", eval = FALSE}
var_region_heatmap_df <- var_random_df %>% 
    filter(callset == "deepHap") %>%
    dplyr::select(variant, var_cat, target_cat, Regions) %>% 
    mutate(region_df = map(Regions, as_data_frame)) %>% 
    dplyr::select(-Regions) %>% 
    unnest()
var_region_heatmap_df %>% 
    filter(!str_detect(value, "HG00[1,3,4,5]"), value !="CONF") %>%
    mutate(CHROM_POS = str_extract(variant, ".*(?=_)")) %>%
    mutate(ALT = str_extract(variant, "(?<=_).*")) %>%
    ggplot() + 
    geom_raster(aes(y = value, x = CHROM_POS, fill = target_cat)) + 
    theme(axis.text.x = element_text(angle = -90)) + 
    facet_wrap(~var_cat, nrow = 1, scales = "free_x")
```

```{r}
### Manual curation spreadsheets
## Already generate files
# tmp_tsv <- tempfile(fileext = ".tsv")
# deepvar_random_df %>%
#     add_column(PacBio = "", GIAB = "", Notes = "") %>%
#     write_tsv(path = tmp_tsv)
# 
# gs_upload(file = tmp_tsv, sheet_title = "hg002-ccs-deepvar-curate-JZ")
# gs_upload(file = tmp_tsv, sheet_title = "hg002-ccs-deepvar-curate-JM")
# gs_upload(file = tmp_tsv, sheet_title = "hg002-ccs-deepvar-curate-NDO")
```


## Benchmark Region Expansion
<!-- __TODO_ clean-up code, incorporate reproducibility -->
```{r}
get_genome <- function(genome){
    if (genome == "hs37d5"){
        require(BSgenome.Hsapiens.1000genomes.hs37d5)
        return(BSgenome.Hsapiens.1000genomes.hs37d5)
    } else if (genome == "GRCh38") {
        require(BSgenome.Hsapiens.NCBI.GRCh38)
        return(BSgenome.Hsapiens.NCBI.GRCh38)
    } else {
        stop("Genome not `hs37d5` or `GRCh38`")
    }
}


get_chrom_sizes <- function(genome = "hs37d5"){

    genome_obj <- get_genome(genome)
    
    get_alpha_freq <- function(i){
        genome_obj[[i]] %>% 
            alphabetFrequency() %>% 
            data.frame()
    }
    
    alpha_freq_df <- as.list(1:22) %>% 
        map_dfc(get_alpha_freq)
    
    colnames(alpha_freq_df) <- paste0("chr",1:22)
    
    alpha_freq_df <- alpha_freq_df %>% 
        ## Removing bases not included in counts and non-standard bases
        filter(chr1 >100) %>% 
        add_column(base = c("A","C","G","T","N"))
    
    chromosome_lengths <- alpha_freq_df %>% 
        tidyr::gather(key = "chrom", value = "nbases", -base) %>% 
        group_by(chrom) %>% 
        mutate(base_type = if_else(base == "N", "N", "non_N")) %>% 
        group_by(chrom, base_type) %>% 
        summarise(n_bases = sum(nbases)) %>% 
        tidyr::spread(base_type, n_bases) %>% 
        mutate(len = N + non_N) %>% 
        dplyr::select(-N)
    
    ## data frame with total length and number of non-N bases
    data_frame(chrom = "genome", 
               non_N = sum(chromosome_lengths$non_N),
               len = sum(chromosome_lengths$len)) %>% 
        bind_rows(chromosome_lengths)
}


### High confidence coverage ###################################################
### Bed chromosome coverage lengths from bed
get_bed_cov_by_chrom <- function(bed_file){
    ## Read as tsv
    bed_df <- read_tsv(bed_file, 
                       col_names = c("chrom","start","end","info"), col_types = "ciic") %>% 
        ## Compute region size
        mutate(region_size = end - start) %>% 
        ## Only looking at Chromosomes 1-22
        filter(chrom %in% c(1:22, paste0("chr", 1:22)))
    
    ## Compute bases per chromosome
    chrom_cov <- bed_df %>% 
        ## Changing to chromosome names to chr 
        mutate(chrom = paste0("chr", chrom)) %>% 
        group_by(chrom,info) %>% 
        summarise(nbases = sum(region_size))

    ## NBases data frame
    data_frame(chrom = "genome",
               nbases = NA, 
               info = "") %>% 
        bind_rows(chrom_cov)
}

chrom_size_df <- get_chrom_sizes()
bench_extend_df <- get_bed_cov_by_chrom("data/benchmark_extend/ccs_extended.bed")
```

Supplemental Figure
```{r}
bench_extend_df %>% 
    filter(info == "CALLABLE") %>% 
    left_join(chrom_size_df) %>% 
    mutate(extend = nbases/non_N) %>% 
    select(chrom, nbases, extend) %>% 
    mutate(chrom = str_remove(chrom, "chr"),
           chrom = factor(chrom, levels = 1:22)) %>%  
    ggplot() + geom_bar(aes(x = chrom, y = extend), stat = "identity") + 
  theme_bw() +
  labs(x = "Chromosome", y = "Proportion Extend")
```

Extend is the fraction of additional non-N bases in the genome that may be covered by benchmark set when using CCS DeepVariant with haplotype callset to generate benchmark regions.
```{r}
bench_extend_df %>% 
    filter(info == "CALLABLE") %>% 
    summarise(nbases = sum(nbases)) %>% 
    mutate(chrom = "genome") %>% 
    left_join(chrom_size_df) %>% 
    mutate(extend = nbases/non_N)
```

### Number of variants in extended regions
```
bcftools view -R ccs_extended.bed ../kolesnikov/pacbio-15kb-hapsort-wgs.vcf.gz -o deepvar_extend_bcftools.vcf
```

<!-- TODO from bcftools stats, will want to replace with R code -->
```
SN      0       number of samples:      1
SN      0       number of records:      418649
SN      0       number of no-ALTs:      0
SN      0       number of SNPs: 210184
SN      0       number of MNPs: 0
SN      0       number of indels:       208691
SN      0       number of others:       0
SN      0       number of multiallelic sites:   764
SN      0       number of multiallelic SNP sites:       123
```

## Errors in NIST HG002 v3.3.2
False negatives in LINEs were frequently observed. 
For example LINE at 21:42288851, where paired end short reads have no evidence but CC and mate-pair do (Fig. __ADD REF__).
Additionally, 10x does for some other phased SNPs in the region. 
Including PacBio CCS callsets as an additional input for the NIST integration pipeline can potentially correct errors in LINEs.
__TODO estimate for total number of errors in LINEs__
<!-- ![lineIGV]("data/igv_images/igv_snapshot_LINE_21_42288851.png") -->
<!-- ![lineIGVzoom]("data/igv_images/igv_snapshot_LINE_21_42288851_zoomout.png") -->

## Characterizing Errors in CCS
Most variant call errors in CCS callsets were adjacent to homopolymers, 
__TODO__ SNP estimate with UCI-LCI and INDEL estimate with UCI and LCI (Table ccsErrorEst).
__TODO__ relationship between number of AM error, FP, and FN.

```{r}
## Get JZ manual curation results
mc_jz <- gs_title(x = "hg002-ccs-deepvar-curate-JZ - November 19, 2:38 AM")
mc_df <- gs_read(ss = mc_jz)

mc_error <- mc_df %>% 
  group_by(target_cat, var_cat, PacBio) %>% 
  summarise(count = n()) %>% 
  filter(PacBio != "-") %>% 
  spread(PacBio, count, fill = 0) %>% 
  mutate(error_rate = N/(N + Y))

var_counts <- anno_df %>%
  filter(callset == "deepHap") %>% 
  dplyr::select(callset, CHROM, POS, target_cat, var_cat) %>% 
  distinct() %>% 
  group_by(callset, target_cat, var_cat) %>% 
  summarise(count = n()) 
```

<!-- TODO ## A better approach is accounting for uncertainty in error rate as well as number of correct variants -->
```{r ccsErrorEst}
error_est_df <- var_counts %>% 
  left_join(mc_error) %>% 
  filter(var_cat != "other") %>% 
  mutate(bconf = map(Y, Hmisc::binconf, n = 5),
         est = map_dbl(bconf, 1) * count,
         lci = map_dbl(bconf, 2) * count,
         uci = map_dbl(bconf, 3) * count)
```

```{r}
sum(error_est_df$count)
error_est_df %>% filter(target_cat == "non-target",
                        !str_detect(var_cat, "AM")) %>% 
  .$est %>% sum()
error_est_df %>% filter(target_cat == "non-target",
                        !str_detect(var_cat, "AM")) %>% .$lci %>% sum()
error_est_df %>% filter(target_cat == "non-target",
                        !str_detect(var_cat, "AM")) %>% .$uci %>% sum()
```
2434 (1313 - 2611)

```{r ccsErrorEst}
error_est_df %>% 
  select(-callset, -bconf) %>% 
  knitr::kable(caption = "Estimated number of variant call errors in the DeepVariant callset with haplotype informed model.")
```

<!-- Nate, for the record, below are the commands I used to get the numbers of FPs in each category.  I suspect you could get these numbers pretty easily in your current R script, since essentially I was getting an estimate of the total number of variants in each category you selected 5 variants from for the manual curation.  Rather than reproducing what I did, which was imprecise, it would be great if you’d calculate the # of AM’s, FPs, and FNs in target and non-target for SNPs and indels.  It could also be useful to subset these further by those in map_all and not in map_all. -->

<!-- 1. Number of FP SNPs for homopolymer stratifications -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*SNP' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep omopol | wc -l -->
<!--      414 -->
<!-- ``` -->

<!-- 2. Number of FP INDELS for homopolymer stratifications  -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*INDEL' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep omopol | wc -l -->
<!--     8674 -->
<!-- ``` -->

<!-- 3. Number of FP INDEL in homopolymer and repeat strats defined below -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*INDEL' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep -v omopol | grep -v lowcmp_AllRepeats_lt51bpTRs_gt95identity_merged_slop5 | grep lowcmp_AllRepeats_lt51bp_gt95identity_merged_slop5 | wc -l -->
<!--      328 -->
<!-- ``` -->

<!-- 4. Number of FP SNP in homopolymer and repeat strats defined below -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*SNP' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep -v omopol | grep -v lowcmp_AllRepeats_lt51bpTRs_gt95identity_merged_slop5 | grep lowcmp_AllRepeats_lt51bp_gt95identity_merged_slop5 | wc -l -->
<!--       29 -->
<!-- ``` -->

<!-- 5. Total FP SNPs -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*SNP' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | wc -l -->
<!--     2684 -->
<!-- ``` -->

<!-- 6. Total FP INDELs -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*INDEL' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | wc -l -->
<!--     9822 -->
<!-- ``` -->

<!-- 7. Number of FP SNP in homopolymer or lowcmp_AllRepeats_gt95identity_merged_slop -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*SNP' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep 'omopol\|,lowcmp_AllRepeats_gt95identity_merged_slop5' | wc -l -->
<!--      414 -->
<!-- ``` -->
<!-- 8. Number of FP INDEL in homopolymer or lowcmp_AllRepeats_gt95identity_merged_slop -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*SNP' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep 'omopol\|,lowcmp_AllRepeats_gt95identity_slop5' | wc -l -->
<!--      565 -->
<!-- ``` -->

<!-- 9. Number of FP SNP in homopolymer or lowcmp_AllRepeats_gt95identity_slop -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*INDEL' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep 'omopol\|,lowcmp_AllRepeats_gt95identity_slop5' | wc -l -->
<!--     9627 -->
<!-- ``` -->

<!-- 10. Number of FP SNP in homopolymer or lowcmp_AllRepeats_gt95identity_slop and map_all -->
<!-- ``` -->
<!-- PN105860:triounion_171212 jzook$ zgrep 'FP.*SNP' /Users/jzook/Documents/National\ Institute\ of\ Standards\ and\ Technology\ \(NIST\)/Olson\,\ Nathanael\ David\ \(Fed\)\ -\ giab-hg002-ccs/data/happy_output_deepVarHap/results/result_1.vcf.gz | grep -v 'omopol\|,lowcmp_AllRepeats_gt95identity_slop5' | grep ,map_all | wc -l -->
<!--            2028 -->
<!-- ``` -->


# Conclusions
__TODO__

# Exploratory Analysis

## Precision-Recall Relationship
```{r message = FALSE, fig.cap = "Relationship between precision and recall for all stratifications, excluding different repeat units."}
gg <- ext_trim_df %>% 
    filter(!str_detect(Subset, "unit=")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    filter(Subtype == "*", Filter == "PASS") %>%
    ggplot() + 
    geom_point(aes(x = METRIC.Recall, 
                   y = METRIC.Precision, 
                   alpha = log10(Subset.Size),
                   text = Subset),
               shape = 19, stroke = 0.25) + 
    facet_wrap(query_method~Type, scales = "free") + 
    theme_bw() + 
    labs(x = "Recall", y = "Precision")
plotly::ggplotly(gg)
```


## GC Impact
GATK4 performance impacted by changes in GC more than DeepVariant and DeepVariant with haplotype information. 
Trends in performance relative GC are potentially confounded with low complexity regions.  


```{r fig.cap = "Benchmarking metrics by GC content for 100 bp windows with 50 bp slop." }
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"gc")) %>% 
    mutate(Subset = str_remove(Subset, "gc_l100_gc")) %>% 
    mutate(Subset = str_remove(Subset, "_slop50")) %>% 
    mutate(Subset = str_replace(Subset, "or", " or ")) %>%
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "lt", "<")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
    mutate(Subset = fct_relevel(Subset, 
                                c(">85","<25 or >65", "<30 or >55"), 
                                after = Inf)) %>% 
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5,
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90), 
          legend.position = "bottom") +
    labs(x = "% GC for 100 bp windows", fill = "Callset")
```

## Difficult Regions

### Low Complexity 

#### Simple Repeats
<!-- __TODO__ Fix order -->

```{r fig.cap = "Benchmarking metrics for different simple repeat stratifications." }
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"lowcmp_SimpleRepeat"),
           !str_detect(Subset,"unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>% 
    mutate(Subset = str_remove(Subset, "lowcmp_SimpleRepeat_")) %>%
    mutate(Subset = str_remove(Subset, "_slop5")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>% 
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
        dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90),
          legend.position = "bottom") + 
    labs(x = "Simple Repeat Type", fill = "Callset")
```

#### TRDB
<!-- __TODO__ Fix order   -->

```{r fig.cap = "Benchmarking metrics for TRDB." }
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"Human_Full_Genome"),
           !str_detect(Subset,"unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>%
    mutate(Subset = str_remove(Subset, "lowcmp_Human_Full_Genome_TRDB_hg19_150331_")) %>%
    mutate(Subset = str_remove(Subset, "_gt95identity_merged")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>%
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "lt", " <")) %>%
    mutate(Subset = str_replace(Subset, "gt", " >")) %>%
        dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) + 
    labs(x = "Tandem Repeat Type", fill = "Callset")
```

#### All Repeats
<!-- __TODO__ Fix order -->

```{r fig.cap = "Benchmarking metrics for all repeat stratifications."}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"AllRepeats"),
           !str_detect(Subset,"unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>%
    mutate(Subset = str_remove(Subset, "lowcmp_AllRepeats_")) %>%
    mutate(Subset = str_remove(Subset, "_gt95identity_merged")) %>%
    mutate(Subset = str_replace(Subset, "_", " ")) %>%
    mutate(Subset = str_replace(Subset, "to", " to ")) %>%
    mutate(Subset = str_replace(Subset, "lt", "<")) %>%
    mutate(Subset = str_replace(Subset, "gt", ">")) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) + 
    labs(x = "Repeat Type", fill = "Callset")
```


### Segmental Duplications
```{r fig.cap = "Performance metrics for segmental duplication stratifications."}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"segdup")) %>% 
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) + 
    labs(x = "Segmental Duplication Category", fill = "Callset")
```

### Mappping
<!-- __TODO__ Fix order and clean-up names   -->

```{r}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(Subset %in% c("map_all", "map_siren", "notinmap_all")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>%
    mutate(Subset = str_remove(Subset, "map_l")) %>%
    mutate(Subset = str_replace_all(Subset, "_", " ")) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) +
    labs(x = "Mapping Category", fill = "Callset")
```


```{r fig.cap = "Performance metrics for mappability stratifications." }
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(str_detect(Subset,"map"),
           !str_detect(Subset,"unit")) %>% 
    filter(Subset.IS_CONF.Size > 0) %>%
    mutate(Subset = str_remove(Subset, "map_l")) %>%
    mutate(Subset = str_replace_all(Subset, "_", " ")) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) +
    labs(x = "Mapping Category", fill = "Callset")
```

## In/ Not-In Comparisons
```{r fig.cap = "Benchmarking metrics for in/not-in stratification comparisons."}
ext_trim_df %>% 
    filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>% 
    filter(Subset %in% c("notinfunc_cds","func_cds",
                         "notinsegdupall", "segdupall",
                         "notinalldifficultregions", "alldifficultregions")) %>% 
    mutate(Subset = fct_relevel(Subset, "func_cds", after = 2)) %>%
    mutate(Subset = fct_relevel(Subset, "notinsegdupall", after = Inf)) %>%
    dplyr::select(Type, Subset, query_method, contains("METRIC"), Subset.Size)  %>% 
    gather("Metric","Value", -Type, -Subset, 
           -query_method, -Subset.Size) %>% 
    mutate(Metric = str_remove(Metric, "METRIC.")) %>% 
    mutate(Metric = if_else(Metric != "Frac_NA", paste("1 -", Metric), Metric),
           Value = if_else(Metric != "Frac_NA", 1 - Value, Value)) %>%
    filter(Metric != "Frac_NA") %>% 
    ggplot() + geom_bar(aes(x = Subset, y = Value, fill = query_method), 
                        color = "grey40", width = 0.5, 
                        position = "dodge", stat = "identity") + 
    facet_grid(Type~Metric, scales = "free") + 
    theme_bw() + 
    theme(axis.text.x = element_text(angle = -90)) + 
    labs(x = "Stratification", fill = "Callset")
```

## Performs well
Statifications with high precision and recall  
```{r}
ext_trim_df %>% 
    # filter(Subtype == "*") %>% 
    filter(Filter == "PASS") %>%
    filter(!str_detect(Subset, "unit")) %>% 
    filter(METRIC.Recall >= 0.999, 
           METRIC.Precision > 0.98) %>% 
    dplyr::select(query_method, Type, Subtype, Subset, 
           contains("METRIC"), 
           Subset.Size, Subset.IS_CONF.Size)
```


# Callset MD5 Check
Matching md5 check sums between vcf files run on benchmarking app and vcf files in "final_callsets". 
The MD5 check sums match therefore we do not need to run the benchmarking app on the files in "final_callsets". 
```{r}
vcf_md5 <- c("b6ce8fbd50b983ea4a3bad5e4d4fc18b  jebler/phased/pacbio_minimap2_15kb_69500_b37_wgs.all.callset_phased.vcf.gz",
"120a15cc95138cd5f1ed107e1e1cd85d  jebler/phased/pacbio_minimap2_15kb_69500_b37_wgs.all.callset_phased.vcf.gz.tbi",
"bbb714c95ca7e6e8369ccfb22e4ae586  jebler/phased/pacbio_minimap2_15kb_69500_b37_wgs.all.retyped_phased.vcf.gz",
"000c980bbc4fb7040fe20947d1762ee3  jebler/phased/pacbio_minimap2_15kb_69500_b37_wgs.all.retyped_phased.vcf.gz.tbi",
"5a049ef09dd22d07b8378668276b4bad  jebler/retyped/pacbio_minimap2_15kb_69500_b37_wgs.all.retyped.vcf.gz",
"f90a2d11cb8135dc951c2c9d5925f816  jebler/retyped/pacbio_minimap2_15kb_69500_b37_wgs.all.retyped.vcf.gz.tbi",
"f756d13a231e660909d16bc69b355059  jebler/retyped/pacbio_pbmm2_15kb_GATK4_hs37d5.all.retyped.vcf.gz",
"f764e0c44772f0db7b241e6c48bbfaba  jebler/retyped/pacbio_pbmm2_15kb_GATK4_hs37d5.all.retyped.vcf.gz.tbi",
"5a049ef09dd22d07b8378668276b4bad  for_jzook/final_callsets/DeepVariant-CCS-WhatsHap.vcf.gz",
"f90a2d11cb8135dc951c2c9d5925f816  for_jzook/final_callsets/DeepVariant-CCS-WhatsHap.vcf.gz.tbi",
"166703dfb3c15b201c5886dde0ace70b  for_jzook/final_callsets/DeepVariant-CCS-hapsort.vcf.gz",
"f868b7e2ecd42d72e6223a9fc9819246  for_jzook/final_callsets/DeepVariant-CCS-hapsort.vcf.gz.tbi",
"a21eb94b7d8ee530244d6d4a28194a59  for_jzook/final_callsets/DeepVariant-CCS.vcf.gz",
"db771e12c949639c0015ad5eea7edf7f  for_jzook/final_callsets/DeepVariant-CCS.vcf.gz.tbi",
"f756d13a231e660909d16bc69b355059  for_jzook/final_callsets/GATKHC-WhatsHap.vcf.gz",
"f764e0c44772f0db7b241e6c48bbfaba  for_jzook/final_callsets/GATKHC-WhatsHap.vcf.gz.tbi",
"0075ff0ca7df131c0b02e426847c28cf  for_jzook/final_callsets/GATKHC.vcf.gz",
"8c1432ec042fd7dc8c14329cd1a37100  for_jzook/final_callsets/GATKHC.vcf.gz.tbi",
"166703dfb3c15b201c5886dde0ace70b  kolesnikov/pacbio-15kb-hapsort-wgs.vcf.gz",
"16577af07101b1105dab9e42d95c71c8  kolesnikov/pacbio-15kb-hapsort-wgs.vcf.gz.csi",
"a21eb94b7d8ee530244d6d4a28194a59  kolesnikov/pacbio_minimap2_15kb_69500_b37_wgs.vcf.gz",
"0075ff0ca7df131c0b02e426847c28cf  wrowell/pacbio_pbmm2_15kb_GATK4_hs37d5.vcf.gz",
"8c1432ec042fd7dc8c14329cd1a37100  wrowell/pacbio_pbmm2_15kb_GATK4_hs37d5.vcf.gz.tbi",
"17fc4f33579bb916a130ebe865b8f128  wrowell/pacbio_pbmm2_15kb_GATK4_hs37d5_whatshap_phased.vcf.gz",
"89c63e379a19cd7616f605bdfb60ca1e  wrowell/pacbio_pbmm2_15kb_GATK4_hs37d5_whatshap_phased.vcf.gz.tbi")

data_frame(vcf_md5) %>% separate(vcf_md5, c("md5","vcffile"), 
                                 sep = "  ", 
                                 remove = TRUE) %>% 
  arrange(md5) %>% 
  filter(str_detect(vcffile, "vcf.gz$")) %>% 
  filter(!str_detect(vcffile, "phased"))
```


# Session Information
## System Information
```{r}
sessioninfo::platform_info()
```


## Package Versions
```{r}
sessioninfo::package_info() %>% 
    knitr::kable(booktabs = TRUE)
```