Skip to content

Commit

Permalink
Merge pull request #23 from U-BDS/dev
Browse files Browse the repository at this point in the history
Various Fixes: Updated documentation and reporting, isoquant downgrading, adding 5 prime support
  • Loading branch information
lianov authored Jul 1, 2024
2 parents 61d8cae + 93c5c05 commit fa1a0dd
Show file tree
Hide file tree
Showing 27 changed files with 2,483 additions and 81 deletions.
47 changes: 27 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,20 +38,19 @@ On release, automated continuous integration tests run the pipeline on a full-si
6. Extract barcodes. Consists of the following steps:
1. Parse FASTQ files into R1 reads containing barcode and UMI and R2 reads containing sequencing without barcode and UMI (custom script `./bin/pre_extract_barcodes.py`)
2. Re-zip FASTQs ([`pigz`](https://github.com/madler/pigz))
7. Post-extraction QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [`NanoPlot`](https://github.com/wdecoster/NanoPlot) and [`ToulligQC`](https://github.com/GenomiqueENS/toulligQC))
8. Alignment ([`minimap2`](https://github.com/lh3/minimap2))
9. SAMtools processing including ([`SAMtools`](http://www.htslib.org/doc/samtools.html)):
7. Barcode correction (custom script `./bin/correct_barcodes.py`)
8. Post-extraction QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [`NanoPlot`](https://github.com/wdecoster/NanoPlot) and [`ToulligQC`](https://github.com/GenomiqueENS/toulligQC))
9. Alignment ([`minimap2`](https://github.com/lh3/minimap2))
10. SAMtools processing including ([`SAMtools`](http://www.htslib.org/doc/samtools.html)):
1. SAM to BAM
2. Filtering of mapped only reads
3. Sorting, indexing and obtain mapping metrics
10. Post-mapping QC in unfiltered BAM files ([`NanoComp`](https://github.com/wdecoster/nanocomp), [`RSeQC`](https://rseqc.sourceforge.net/))
11. Barcode tagging with read quality, BC, BC quality, UMI, and UMI quality (custom script `./bin/tag_barcodes.py`)
12. Barcode correction (custom script `./bin/correct_barcodes.py`)
13. Post correction QC for corrected bams ([`SAMtools`](http://www.htslib.org/doc/samtools.html))
14. UMI-based deduplication [`UMI-tools`](https://github.com/CGATOxford/UMI-tools)
15. Gene and transcript level matrices generation. [`IsoQuant`](https://github.com/ablab/IsoQuant)
16. Preliminary matrix QC ([`Seurat`](https://github.com/satijalab/seurat))
17. Compile QC for raw reads, trimmed reads, pre and post-extracted reads, mapping metrics and preliminary single-cell/nuclei QC ([`MultiQC`](http://multiqc.info/))
11. Post-mapping QC in unfiltered BAM files ([`NanoComp`](https://github.com/wdecoster/nanocomp), [`RSeQC`](https://rseqc.sourceforge.net/))
12. Barcode tagging with read quality, BC, BC quality, UMI, and UMI quality (custom script `./bin/tag_barcodes.py`)
13. UMI-based deduplication [`UMI-tools`](https://github.com/CGATOxford/UMI-tools)
14. Gene and transcript level matrices generation. [`IsoQuant`](https://github.com/ablab/IsoQuant)
15. Preliminary matrix QC ([`Seurat`](https://github.com/satijalab/seurat))
16. Compile QC for raw reads, trimmed reads, pre and post-extracted reads, mapping metrics and preliminary single-cell/nuclei QC ([`MultiQC`](http://multiqc.info/))

## Usage

Expand Down Expand Up @@ -100,19 +99,33 @@ This pipeline produces feature barcode matrices at both the gene and transcript

If you experience any issues, please make sure to submit an issue above. However, some resolutions for common issues will be noted below:

- Due to the nature of the data this pipeline analyzes, some tools can experience increased runtimes. For some of the custom tools made for this pipeline (`preextract_fastq.py` and `correct_barcodes.py`), we have leveraged the splitting that is done via the `split_amount` param to decrease their overall runtimes. The `split_amount` parameter will split the input fastqs into a number of fastq files that each have a number of lines based on the value used for this parameter. As a result, it is important not to set this parameter to be too low as it would cause the creation of a large number of files the pipeline will be processed. While this value can be highly dependent on the data, a good starting point for an analysis would be to set this value to `500000`. If you find that `PREEXTRACT_FASTQ` and `CORRECT_BARCODES` are still taking long amounts of time to run, it would be worth reducing this parameter to `200000` or `100000`, but keeping the value on the order of hundred of thousands or tens of thousands should help with with keeping the total number of processes minimal.
- One issue that has been observed is a recurrent node failure on slurm clusters that does seem to be related to submission of nextflow jobs. This issue is not related to this pipeline itself, but rather to nextflow itself. Our reserach computing are currently working on a resolution. But we have two methods that appear to help overcome should this issue arise:
1. The first is to create a custom config that increases the memory request for the job that failed. This may take a couple attempts to find the correct requests, but we have noted that there does appear to be a memory issue occassionally with this errors.
2. The second resolution is to request an interactive session with a decent amount of time and memory and cpus in order to run the pipeline on the single node. Note that this will take time as there will be minimal parallelization, but this does seem to resolve the issue.
- We acknowledge that analyzing PromethION is a common use case for this pipeline. Currently, the pipeline has been developed with defaults to analyze GridION and average sized PromethION data. For cases, where jobs have failed due for larger PromethION datasets, the defaults have been overwritten by a custom configuation file (provided by the `-c` Nextflow option) where resources were increased (substantially in some cases). Below are some of the overrides we have used, while these amounts may not work on every dataset, these will hopefully at least note which processes will need to have their resources increased:

```
process
{
withName: '.*:.*FASTQC.*'
{
cpus = 20
}
}
//NOTE: reminder that params set in modules.config need to be copied over to a custom config
process
{
withName: '.*:BLAZE'
{
cpus = 24
ext.args = '--threads 30'
ext.args = {
[
"--threads 30",
params.barcode_format == "10X_3v3" ? "--kit-version 3v3" : params.barcode_format == "10X_5v2" ? "--kit-version 5v2" : ""
].join(' ').trim()
}
}
}
Expand All @@ -136,13 +149,7 @@ process
{
withName: '.*:ISOQUANT'
{
ext.args = {
[
"--threads 30",
"--complete_genedb",
params.stranded == "forward" ? "--stranded forward" : params.stranded == "reverse" ? "--stranded reverse" : "--stranded none",
].join(' ').trim()
}
cpus = 40
time = '135.h'
}
}
Expand Down
15 changes: 15 additions & 0 deletions assets/multiqc_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,13 +55,26 @@ top_modules:
- "*extracted*fastqc*"

custom_content:
read_counts_section:
parent_id: read_counts_section
order:
read_counts_module

seurat_section:
parent_id: seurat_section
order:
- transcript_seurat_stats_module
- gene_seurat_stats_module

custom_data:
read_counts_module:
parent_id: read_counts_section
parent_name: "Read Counts Section"
parent_description: "Read counts of samples as they progress through the pipeline"
section_name: "Read Counts"
file_format: "csv"
plot_type: "table"

gene_seurat_stats_module:
parent_id: seurat_section
parent_name: "Seurat Section"
Expand All @@ -87,3 +100,5 @@ sp:
fn: "gene.*.tsv"
transcript_seurat_stats_module:
fn: "transcript.*.tsv"
read_counts_module:
fn: "read_counts.csv"
Binary file modified assets/scnanoseq_diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/whitelist/737K-august-2016.txt.zip
Binary file not shown.
90 changes: 90 additions & 0 deletions bin/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# This file stores the parameter used in this repo

import os
import numpy as np

## Output prefix
DEFAULT_PREFIX = ''

####################################################
############# polyT and adaptor finding#############
####################################################
## adaptor finding
ADPT_SEQ='CTTCCGATCT' #searched adaptor sequence
ADPT_WIN=200 #search adaptor in subsequence from both end of the reads with this size
ADPT_MAC_MATCH_ED=2 #minimum proportion of match required when searching

## format suffix
SEQ_SUFFIX_WIN=200
SEQ_SUFFIX_MIN_MATCH_PROP=1
SEQ_SUFFIX_AFT_ADPT=(20,50)

## poly T searching
PLY_T_LEN=4 #length of searched poly T

## TSO searching
TSO_SEQ='TTTCTTATATGGG'

####################################################
####### DEFAULT in getting putative bc ######
####################################################
# input
DEFAULT_GRB_MIN_SCORE=15
DEFAULT_GRB_KIT='v3'
DEFAULT_UMI_SIZE= 12 if DEFAULT_GRB_KIT=='v3' else 10

# The 10X barcode whitelists has been packed in the package
DEFAULT_GRB_WHITELIST_V3=os.path.join(os.path.dirname(__file__), '10X_bc', '3M-february-2018.zip')
DEFAULT_GRB_WHITELIST_V2=os.path.join(os.path.dirname(__file__), '10X_bc', '737K-august-2016.txt')

#output
DEFAULT_GRB_OUT_RAW_BC='putative_bc.csv'
DEFAULT_GRB_OUT_WHITELIST = 'whitelist.csv'
DEFAULT_GRB_OUT_FASTQ = "matched_reads.fastq.gz"
DEFAULT_GRB_FLANKING_SIZE = 5

####################################################
##### DEFAULT in generating whitelist ######
####################################################
# quantile based threshold
def default_count_threshold_calculation(count_array, exp_cells):
top_count = np.sort(count_array)[::-1][:exp_cells]
return np.quantile(top_count, 0.95)/20

def high_sensitivity_threshold_calculation(count_array, exp_cells):
top_count = np.sort(count_array)[::-1][:exp_cells]
return np.quantile(top_count, 0.95)/200

# list for empty drops (output in high-sensitivity mode)
DEFAULT_EMPTY_DROP_FN = 'emtpy_bc_list.csv'
DEFAULT_KNEE_PLOT_FN = 'knee_plot.png'
DEFAULT_BC_STAT_FN = "summary.txt"
DEFAULT_EMPTY_DROP_MIN_ED = 5 # minimum edit distance from emtpy drop BC to selected BC
DEFAULT_EMPTY_DROP_NUM = 2000 # number of BC in the output


####################################################
##### DEFAULT in Demultiplexing ######
####################################################

DEFAULT_ASSIGNMENT_ED = 2
# Make sure this is smaller than DEFAULT_GRB_FLANKING_SIZE
assert DEFAULT_GRB_FLANKING_SIZE >= DEFAULT_ASSIGNMENT_ED
DEFAULT_ED_FLANKING = DEFAULT_ASSIGNMENT_ED



BLAZE_LOGO = \
"""
BBBBBBBBBBBBBBB LLLLL AAAAAA ZZZZZZZZZZZZZZEEEEEEEEEEEEEEE
BBBBB&&&&&&&BBBB LLLLL AAAAAAAA ZZZZZZZZZZZZZZEEEEEEEEEEEEEEE
BBBBB^^^^^^!BBBB LLLLL AAAAAAAAAA. ZZZZZ. EEEE
BBBBB BBBB LLLLL AAAA AAAA. ZZZZZ. EEEEEEEEEEEEEEE
BBBBBBBBBBBBBBB LLLLL AAAAAAAAAAAAAA ZZZZZ. EEEEEEEEEEEEEEE
BBBBBBBBBBBBBBB LLLLL AAAAAAAAAAAAAAAA. ZZZZZ. EEEE
BBBBB BBBB LLLLL AAAAAA AAAAA ZZZZZZZZZZZZEEEEEEEEEEEEEEE
BBBBB BBBB LLLLLAAAAAA AAAAAZZZZZZZZZZZZEEEEEEEEEEEEEEE
BBBBBBBBBBBBBBBB LLLLLLLLLLLLLLLLLLLL . ^PPPPPPY. ^PPPPPPY. 7PPP
BBBBBBBBBBBBBBB LLLLLLLLLLLLLLLLLLLL...!BBBBBBP:...!BBBBBBP:.::.?BBB
"""

6 changes: 3 additions & 3 deletions bin/correct_barcodes.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def parse_args():
help="The minimum posterior probability a barcode on the whitelist must have to replace "
"the barcode detected by the pipeline"
)

#parser.add_argument(print_header, default=True)
parser.set_defaults(print_header=True)

Expand Down Expand Up @@ -126,12 +126,12 @@ def correct_barcode(infile, outfile, whitelist, barcode_count_file, min_post_pro
Output: None
"""

with open(infile, 'r') as infile_h, open(outfile, 'w') as outfile_h:

# Turn the whitelist into a trie, as it allows for very quick access to check for a barcode
whitelist_trie = read_whitelist(whitelist)

# Turn the barcode abundances into percentages
bc_probabilities = calculate_bc_ratios(barcode_count_file)

Expand Down
69 changes: 69 additions & 0 deletions bin/find_reads.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# find specific reads with given read id and output to a new fastq file


import argparse
import Bio.SeqIO
import multiprocessing as mp
import textwrap
from pathlib import Path


import helper

def parse_arg():
parser = argparse.ArgumentParser(
description=textwrap.dedent(
'''
Find specific reads with given read id and output to a new fastq file
'''))

# Required positional argument
parser.add_argument('input_fastq_dir', type=str,
help='Fastq directory, Note that this should be a folder.')

# required name argment
requiredNamed = parser.add_argument_group('These Arguments are required')
requiredNamed.add_argument(
'--output_file', type=str, required = True,
help='Filename for the output fastq.')
requiredNamed.add_argument('--id_file', type=str, required = True,
help='A file containing all the read ids to look for.')

parser.add_argument('--threads', type=int,
help='Number of threads. Default: # of CPU - 1')


args = parser.parse_args()

return args

def find_reads(in_fastq, id_list):
fastq = Bio.SeqIO.parse(in_fastq, "fastq")
read_list = [r for r in fastq if r.id in id_list]
return read_list
# check id
def main(args):

# get ids (from args.id_file)
ids = []
with open (args.id_file, 'r') as f:
for line in f:
ids.append(line.strip())


fastq_fns = list(Path(args.input_fastq_dir).rglob('*.fastq'))
rst_futures = helper.multiprocessing_submit(find_reads,
fastq_fns,
n_process=args.threads,
id_list = ids)

rst_ls = []
for f in rst_futures:
rst_ls+=f.result()

print(len(rst_ls))
Bio.SeqIO.write(rst_ls, args.output_file, 'fastq')

if __name__ == '__main__':
args = parse_arg()
main(args)
82 changes: 82 additions & 0 deletions bin/generate_read_counts.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@

get_fastqc_counts()
{
fastqc_file=$1
counts=$(unzip -p ${fastqc_file} $(basename ${fastqc_file} .zip)/fastqc_data.txt | \
grep 'Total Sequences' | \
cut -f2 -d$'\t')
echo $counts

}

output=""
input=""

while [[ $# -gt 0 ]]
do
flag=$1

case "${flag}" in
--input) input=$2; shift;;
--output) output=$2; shift;;
*) echo "Unknown option $1 ${reset}" && exit 1
esac
shift
done

header=""
data=""

header="sample,base_fastq_counts,trimmed_read_counts,extracted_read_counts,corrected_read_counts"
echo "$header" > $output

for sample_name in $(for file in $(readlink -f $input)/*.zip; do echo $file; done | cut -f1 -d'.' | sort -u)
do
raw_fastqc="${sample_name}.raw_fastqc.zip"
trim_fastqc="${sample_name}.trimmed_fastqc.zip"
extract_fastqc="${sample_name}.extracted_fastqc.zip"
correct_csv="${sample_name}.corrected_bc_umi.tsv"
data="$(basename $sample_name)"

# RAW FASTQ COUNTS

if [[ -s "$raw_fastqc" ]]
then
fastqc_counts=$(get_fastqc_counts "$raw_fastqc")
data="$data,$fastqc_counts"
else
data="$data,"
fi

# TRIM COUNTS

if [[ -s "$trim_fastqc" ]]
then
trim_counts=$(get_fastqc_counts "$trim_fastqc")
data="$data,$trim_counts"
else
data="$data,"
fi

# PREEXTRACT COUNTS

if [ -s "$extract_fastqc" ]
then
extract_counts=$(get_fastqc_counts "$extract_fastqc")
data="$data,$extract_counts"
else
data="$data,"
fi

# CORRECT COUNTS


if [ -s $correct_csv ]
then
correct_counts=$(cut -f6 $correct_csv | awk '{if ($0 != "") {print $0}}' | wc -l)
data="$data,$correct_counts"
else
data="$data,"
fi
echo "$data" >> $output
done
Loading

0 comments on commit fa1a0dd

Please sign in to comment.