Merge pull request #23 from U-BDS/dev

Various Fixes: Updated documentation and reporting, isoquant downgrading, adding 5 prime support
nf-core · Jul 1, 2024 · fa1a0dd · fa1a0dd
2 parents 61d8cae + 93c5c05
commit fa1a0dd
Show file tree

Hide file tree

Showing 27 changed files with 2,483 additions and 81 deletions.
diff --git a/README.md b/README.md
@@ -38,20 +38,19 @@ On release, automated continuous integration tests run the pipeline on a full-si
 6. Extract barcodes. Consists of the following steps:
    1. Parse FASTQ files into R1 reads containing barcode and UMI and R2 reads containing sequencing without barcode and UMI (custom script `./bin/pre_extract_barcodes.py`)
    2. Re-zip FASTQs ([`pigz`](https://github.com/madler/pigz))
-7. Post-extraction QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [`NanoPlot`](https://github.com/wdecoster/NanoPlot) and [`ToulligQC`](https://github.com/GenomiqueENS/toulligQC))
-8. Alignment ([`minimap2`](https://github.com/lh3/minimap2))
-9. SAMtools processing including ([`SAMtools`](http://www.htslib.org/doc/samtools.html)):
+7. Barcode correction (custom script `./bin/correct_barcodes.py`)
+8. Post-extraction QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [`NanoPlot`](https://github.com/wdecoster/NanoPlot) and [`ToulligQC`](https://github.com/GenomiqueENS/toulligQC))
+9. Alignment ([`minimap2`](https://github.com/lh3/minimap2))
+10. SAMtools processing including ([`SAMtools`](http://www.htslib.org/doc/samtools.html)):
    1. SAM to BAM
    2. Filtering of mapped only reads
    3. Sorting, indexing and obtain mapping metrics
-10. Post-mapping QC in unfiltered BAM files ([`NanoComp`](https://github.com/wdecoster/nanocomp), [`RSeQC`](https://rseqc.sourceforge.net/))
-11. Barcode tagging with read quality, BC, BC quality, UMI, and UMI quality (custom script `./bin/tag_barcodes.py`)
-12. Barcode correction (custom script `./bin/correct_barcodes.py`)
-13. Post correction QC for corrected bams ([`SAMtools`](http://www.htslib.org/doc/samtools.html))
-14. UMI-based deduplication [`UMI-tools`](https://github.com/CGATOxford/UMI-tools)
-15. Gene and transcript level matrices generation. [`IsoQuant`](https://github.com/ablab/IsoQuant)
-16. Preliminary matrix QC ([`Seurat`](https://github.com/satijalab/seurat))
-17. Compile QC for raw reads, trimmed reads, pre and post-extracted reads, mapping metrics and preliminary single-cell/nuclei QC ([`MultiQC`](http://multiqc.info/))
+11. Post-mapping QC in unfiltered BAM files ([`NanoComp`](https://github.com/wdecoster/nanocomp), [`RSeQC`](https://rseqc.sourceforge.net/))
+12. Barcode tagging with read quality, BC, BC quality, UMI, and UMI quality (custom script `./bin/tag_barcodes.py`)
+13. UMI-based deduplication [`UMI-tools`](https://github.com/CGATOxford/UMI-tools)
+14. Gene and transcript level matrices generation. [`IsoQuant`](https://github.com/ablab/IsoQuant)
+15. Preliminary matrix QC ([`Seurat`](https://github.com/satijalab/seurat))
+16. Compile QC for raw reads, trimmed reads, pre and post-extracted reads, mapping metrics and preliminary single-cell/nuclei QC ([`MultiQC`](http://multiqc.info/))
 
 ## Usage
 
@@ -100,19 +99,33 @@ This pipeline produces feature barcode matrices at both the gene and transcript
 
 If you experience any issues, please make sure to submit an issue above. However, some resolutions for common issues will be noted below:
 
+- Due to the nature of the data this pipeline analyzes, some tools can experience increased runtimes. For some of the custom tools made for this pipeline (`preextract_fastq.py` and `correct_barcodes.py`), we have leveraged the splitting that is done via the `split_amount` param to decrease their overall runtimes. The `split_amount` parameter will split the input fastqs into a number of fastq files that each have a number of lines based on the value used for this parameter. As a result, it is important not to set this parameter to be too low as it would cause the creation of a large number of files the pipeline will be processed. While this value can be highly dependent on the data, a good starting point for an analysis would be to set this value to `500000`. If you find that `PREEXTRACT_FASTQ` and `CORRECT_BARCODES` are still taking long amounts of time to run, it would be worth reducing this parameter to `200000` or `100000`, but keeping the value on the order of hundred of thousands or tens of thousands should help with with keeping the total number of processes minimal.
 - One issue that has been observed is a recurrent node failure on slurm clusters that does seem to be related to submission of nextflow jobs. This issue is not related to this pipeline itself, but rather to nextflow itself. Our reserach computing are currently working on a resolution. But we have two methods that appear to help overcome should this issue arise:
   1. The first is to create a custom config that increases the memory request for the job that failed. This may take a couple attempts to find the correct requests, but we have noted that there does appear to be a memory issue occassionally with this errors.
   2. The second resolution is to request an interactive session with a decent amount of time and memory and cpus in order to run the pipeline on the single node. Note that this will take time as there will be minimal parallelization, but this does seem to resolve the issue.
 - We acknowledge that analyzing PromethION is a common use case for this pipeline. Currently, the pipeline has been developed with defaults to analyze GridION and average sized PromethION data. For cases, where jobs have failed due for larger PromethION datasets, the defaults have been overwritten by a custom configuation file (provided by the `-c` Nextflow option) where resources were increased (substantially in some cases). Below are some of the overrides we have used, while these amounts may not work on every dataset, these will hopefully at least note which processes will need to have their resources increased:
 
 ```
 
+process
+{
+    withName: '.*:.*FASTQC.*'
+    {   
+        cpus = 20
+    }   
+}
+
+//NOTE: reminder that params set in modules.config need to be copied over to a custom config
 process
 {
     withName: '.*:BLAZE'
     {
-        cpus = 24
-        ext.args = '--threads 30'
+        ext.args = {
+            [
+                "--threads 30",
+                params.barcode_format == "10X_3v3" ? "--kit-version 3v3" : params.barcode_format == "10X_5v2" ? "--kit-version 5v2" : ""
+            ].join(' ').trim()
+        }
     }
 }
 
@@ -136,13 +149,7 @@ process
 {
     withName: '.*:ISOQUANT'
     {
-        ext.args = {
-            [
-                "--threads 30",
-                "--complete_genedb",
-                params.stranded == "forward" ? "--stranded forward" : params.stranded == "reverse" ? "--stranded reverse" : "--stranded none",
-            ].join(' ').trim()
-        }
+        cpus = 40
         time = '135.h'
     }
 }

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -55,13 +55,26 @@ top_modules:
         - "*extracted*fastqc*"
 
 custom_content:
+  read_counts_section:
+    parent_id: read_counts_section
+    order:
+        read_counts_module
+
   seurat_section:
     parent_id: seurat_section
     order:
       - transcript_seurat_stats_module
       - gene_seurat_stats_module
 
 custom_data:
+  read_counts_module:
+    parent_id: read_counts_section
+    parent_name: "Read Counts Section"
+    parent_description: "Read counts of samples as they progress through the pipeline"
+    section_name: "Read Counts"
+    file_format: "csv"
+    plot_type: "table"
+
   gene_seurat_stats_module:
     parent_id: seurat_section
     parent_name: "Seurat Section"
@@ -87,3 +100,5 @@ sp:
     fn: "gene.*.tsv"
   transcript_seurat_stats_module:
     fn: "transcript.*.tsv"
+  read_counts_module:
+    fn: "read_counts.csv"
diff --git a/assets/scnanoseq_diagram.png b/assets/scnanoseq_diagram.png
diff --git a/assets/whitelist/737K-august-2016.txt.zip b/assets/whitelist/737K-august-2016.txt.zip
diff --git a/bin/config.py b/bin/config.py
@@ -0,0 +1,90 @@
+# This file stores the parameter used in this repo
+
+import os
+import numpy as np
+
+## Output prefix
+DEFAULT_PREFIX = ''
+
+####################################################
+############# polyT and adaptor finding#############
+####################################################
+## adaptor finding
+ADPT_SEQ='CTTCCGATCT' #searched adaptor sequence
+ADPT_WIN=200 #search adaptor in subsequence from both end of the reads with this size
+ADPT_MAC_MATCH_ED=2 #minimum proportion of match required when searching
+
+## format suffix
+SEQ_SUFFIX_WIN=200 
+SEQ_SUFFIX_MIN_MATCH_PROP=1
+SEQ_SUFFIX_AFT_ADPT=(20,50)
+
+## poly T searching
+PLY_T_LEN=4 #length of searched poly T
+
+## TSO searching
+TSO_SEQ='TTTCTTATATGGG'
+
+####################################################
+#######    DEFAULT in getting putative bc     ######
+####################################################
+# input
+DEFAULT_GRB_MIN_SCORE=15
+DEFAULT_GRB_KIT='v3'
+DEFAULT_UMI_SIZE= 12 if DEFAULT_GRB_KIT=='v3' else 10
+
+# The 10X barcode whitelists has been packed in the package
+DEFAULT_GRB_WHITELIST_V3=os.path.join(os.path.dirname(__file__), '10X_bc', '3M-february-2018.zip')
+DEFAULT_GRB_WHITELIST_V2=os.path.join(os.path.dirname(__file__), '10X_bc', '737K-august-2016.txt')
+
+#output
+DEFAULT_GRB_OUT_RAW_BC='putative_bc.csv'
+DEFAULT_GRB_OUT_WHITELIST = 'whitelist.csv'
+DEFAULT_GRB_OUT_FASTQ = "matched_reads.fastq.gz"
+DEFAULT_GRB_FLANKING_SIZE = 5
+
+####################################################
+#####    DEFAULT in generating  whitelist     ######
+####################################################
+# quantile based threshold
+def default_count_threshold_calculation(count_array, exp_cells):
+    top_count = np.sort(count_array)[::-1][:exp_cells]
+    return np.quantile(top_count, 0.95)/20
+
+def high_sensitivity_threshold_calculation(count_array, exp_cells):
+    top_count = np.sort(count_array)[::-1][:exp_cells]
+    return np.quantile(top_count, 0.95)/200
+
+# list for empty drops (output in high-sensitivity mode)
+DEFAULT_EMPTY_DROP_FN = 'emtpy_bc_list.csv'
+DEFAULT_KNEE_PLOT_FN = 'knee_plot.png'
+DEFAULT_BC_STAT_FN = "summary.txt"
+DEFAULT_EMPTY_DROP_MIN_ED = 5 # minimum edit distance from emtpy drop BC to selected BC
+DEFAULT_EMPTY_DROP_NUM = 2000 # number of BC in the output
+
+
+####################################################
+#####    DEFAULT in Demultiplexing            ######
+####################################################
+
+DEFAULT_ASSIGNMENT_ED = 2 
+# Make sure this is smaller than DEFAULT_GRB_FLANKING_SIZE
+assert DEFAULT_GRB_FLANKING_SIZE >= DEFAULT_ASSIGNMENT_ED
+DEFAULT_ED_FLANKING = DEFAULT_ASSIGNMENT_ED
+
+
+
+BLAZE_LOGO = \
+"""
+BBBBBBBBBBBBBBB  LLLLL       AAAAAA     ZZZZZZZZZZZZZZEEEEEEEEEEEEEEE
+BBBBB&&&&&&&BBBB LLLLL      AAAAAAAA    ZZZZZZZZZZZZZZEEEEEEEEEEEEEEE
+BBBBB^^^^^^!BBBB LLLLL     AAAAAAAAAA.         ZZZZZ. EEEE
+BBBBB       BBBB LLLLL    AAAA    AAAA.       ZZZZZ.  EEEEEEEEEEEEEEE
+BBBBBBBBBBBBBBB  LLLLL   AAAAAAAAAAAAAA      ZZZZZ.   EEEEEEEEEEEEEEE
+BBBBBBBBBBBBBBB  LLLLL  AAAAAAAAAAAAAAAA.   ZZZZZ.    EEEE
+BBBBB       BBBB LLLLL AAAAAA       AAAAA ZZZZZZZZZZZZEEEEEEEEEEEEEEE
+BBBBB       BBBB LLLLLAAAAAA         AAAAAZZZZZZZZZZZZEEEEEEEEEEEEEEE
+BBBBBBBBBBBBBBBB LLLLLLLLLLLLLLLLLLLL . ^PPPPPPY.   ^PPPPPPY.    7PPP
+BBBBBBBBBBBBBBB  LLLLLLLLLLLLLLLLLLLL...!BBBBBBP:...!BBBBBBP:.::.?BBB
+"""
+
diff --git a/bin/correct_barcodes.py b/bin/correct_barcodes.py
@@ -71,7 +71,7 @@ def parse_args():
         help="The minimum posterior probability a barcode on the whitelist must have to replace "
         "the barcode detected by the pipeline"
     )
-    
+
     #parser.add_argument(print_header, default=True)
     parser.set_defaults(print_header=True)
 
@@ -126,12 +126,12 @@ def correct_barcode(infile, outfile, whitelist, barcode_count_file, min_post_pro
     Output: None
 
     """
-    
+
     with open(infile, 'r') as infile_h, open(outfile, 'w') as outfile_h:
 
         # Turn the whitelist into a trie, as it allows for very quick access to check for a barcode
         whitelist_trie = read_whitelist(whitelist)
-        
+
         # Turn the barcode abundances into percentages
         bc_probabilities = calculate_bc_ratios(barcode_count_file)
 

diff --git a/bin/find_reads.py b/bin/find_reads.py
@@ -0,0 +1,69 @@
+# find specific reads with given read id and output to a new fastq file
+
+
+import argparse
+import Bio.SeqIO
+import multiprocessing as mp
+import textwrap
+from pathlib import Path
+
+
+import helper 
+
+def parse_arg():
+    parser = argparse.ArgumentParser(
+        description=textwrap.dedent(
+        '''
+        Find specific reads with given read id and output to a new fastq file
+        '''))
+
+    # Required positional argument
+    parser.add_argument('input_fastq_dir', type=str,
+                        help='Fastq directory, Note that this should be a folder.')
+
+    # required name argment
+    requiredNamed = parser.add_argument_group('These Arguments are required')
+    requiredNamed.add_argument(
+        '--output_file', type=str, required = True,
+        help='Filename for the output fastq.')
+    requiredNamed.add_argument('--id_file', type=str, required = True,
+                        help='A file containing all the read ids to look for.')
+
+    parser.add_argument('--threads', type=int,
+                        help='Number of threads. Default: # of CPU - 1')
+
+
+    args = parser.parse_args()
+
+    return args
+
+def find_reads(in_fastq, id_list):
+    fastq = Bio.SeqIO.parse(in_fastq, "fastq")
+    read_list = [r for r in fastq if r.id in id_list]
+    return read_list
+    # check id
+def main(args):
+
+    # get ids (from args.id_file)
+    ids = []
+    with open (args.id_file, 'r') as f:
+        for line in f:
+            ids.append(line.strip())
+
+
+    fastq_fns = list(Path(args.input_fastq_dir).rglob('*.fastq'))
+    rst_futures = helper.multiprocessing_submit(find_reads,
+                                                fastq_fns, 
+                                                n_process=args.threads, 
+                                                id_list = ids)
+
+    rst_ls = []
+    for f in rst_futures:
+        rst_ls+=f.result()
+
+    print(len(rst_ls))
+    Bio.SeqIO.write(rst_ls, args.output_file, 'fastq')
+
+if __name__ == '__main__':
+    args = parse_arg()
+    main(args)
diff --git a/bin/generate_read_counts.sh b/bin/generate_read_counts.sh
@@ -0,0 +1,82 @@
+
+get_fastqc_counts()
+{
+    fastqc_file=$1
+    counts=$(unzip -p ${fastqc_file} $(basename ${fastqc_file} .zip)/fastqc_data.txt | \
+        grep 'Total Sequences' | \
+        cut -f2 -d$'\t')
+    echo $counts
+
+}
+
+output=""
+input=""
+
+while [[ $# -gt 0 ]]
+do
+    flag=$1
+
+    case "${flag}" in
+        --input) input=$2; shift;;
+        --output) output=$2; shift;;
+        *) echo "Unknown option $1 ${reset}" && exit 1
+    esac
+    shift
+done
+
+header=""
+data=""
+
+header="sample,base_fastq_counts,trimmed_read_counts,extracted_read_counts,corrected_read_counts"
+echo "$header" > $output
+
+for sample_name in $(for file in $(readlink -f $input)/*.zip; do echo $file; done | cut -f1 -d'.' | sort -u)
+do
+    raw_fastqc="${sample_name}.raw_fastqc.zip"
+    trim_fastqc="${sample_name}.trimmed_fastqc.zip"
+    extract_fastqc="${sample_name}.extracted_fastqc.zip"
+    correct_csv="${sample_name}.corrected_bc_umi.tsv"
+    data="$(basename $sample_name)"
+
+    # RAW FASTQ COUNTS
+
+    if [[ -s "$raw_fastqc" ]]
+    then
+        fastqc_counts=$(get_fastqc_counts "$raw_fastqc")
+        data="$data,$fastqc_counts"
+    else
+        data="$data,"
+    fi
+
+    # TRIM COUNTS
+
+    if [[ -s "$trim_fastqc" ]]
+    then
+        trim_counts=$(get_fastqc_counts "$trim_fastqc")
+        data="$data,$trim_counts"
+    else
+        data="$data,"
+    fi
+
+    # PREEXTRACT COUNTS
+
+    if [ -s "$extract_fastqc" ]
+    then
+        extract_counts=$(get_fastqc_counts "$extract_fastqc")
+        data="$data,$extract_counts"
+    else
+        data="$data,"
+    fi
+
+    # CORRECT COUNTS
+
+
+    if [ -s $correct_csv ]
+    then
+        correct_counts=$(cut -f6 $correct_csv | awk '{if ($0 != "") {print $0}}' | wc -l)
+        data="$data,$correct_counts"
+    else
+        data="$data,"
+    fi
+    echo "$data" >> $output
+done