A nextflow implementation of the microbial variant calling pipeline RedDog
. The
central goal of RedDog
is to identify high-quality SNPs present in a set of isolates using a mapping-based approach.
For the RedDog
veterans:
# Clone the reddog-nf repository
git clone https://github.com/katholt/reddog-nf.git && cd reddog-nf
# Install dependencies and activate conda environment
conda create -c bioconda -c conda-forge -p $(pwd -P)/conda_env --yes --file config/conda_dependencies.txt
conda activate $(pwd -P)/conda_env
# Set input reads, reference, and output directory in configuration file
vim nextflow.config
# Launch pipeline
./reddog.nf
To run the pipeline on M3, there are additional steps that need to be taken and some important considerations. Software
provision is similar to the general example above but you must first load miniconda
into your path:
# Clone the reddog-nf repository
git clone https://github.com/katholt/reddog-nf.git && cd reddog-nf
# Load the miniconda3 module and install dependencies
module load miniconda3/4.1.11-python3.5
conda create -c bioconda -c conda-forge -p $(pwd -P)/conda_env --yes --file config/conda_dependencies.txt
Note that this is the only version of miniconda
that will work; other versions are currently broken and will not be fix for
M3 users.
RedDog
should be run in a screen
session so that it will persist between ssh sessions:
# Create and attach to a screen session, load conda environment
screen -R reddog_example
conda activate $(pwd -P)/conda_env
Configuration for input reads, reference, and output directory is done as usual. Once the pipeline has been configured you
can execute the run, however the -profile
argument must be provided explicitly. This is to prevent inadvertently running
pipeline processes on the head node.
# Set input reads, reference, and output directory in configuration file
vim nextflow.config
# Launch pipeline with a specific execution profile
./reddog.nf -profile massive
Lastly, the ability to resume pipeline execution is built into nextflow. To resume a run the -resume
argument is used.
Note that if the output directory already exists, you must confirm that RedDog
can overwrite its contents by specifying
--force
.
# Reattached to the screen session and resume pipeline execution
screen -R reddog_example
./reddog.nf -profile massive -resume --force
Running RedDog
is a two step process that involves configuration and then execution. The nextflow.config
file controls
pipeline configuration, and execution is done by issuing ./reddog.nf
.
Further detail regarding configuration and execution is given below.
Running RedDog
requires a set of short reads in FASTQ format, a reference assembly in GenBank format, and an output
directory. These are set in the Input and output section of the nextflow.config
file, for example:
// Input and output
reads = 'reads/*.fastq.gz'
reference = 'data/reference.gbk'
output_dir = 'output/'
In the above snippet, the input reads are specified with a glob containing the star wildcard character. This expression
selects all files in the reads/
directory that have a suffix of .fastq.gz
. If your input files are present in more than
one directory, a list of globs can be provided:
reads = ['read_set_1/*.fastq.gz', 'read_set_2/*fastq.gz']
The nextflow implementation of RedDog
also handles input containing both paired-end and single-end read sets.
General
Option | Description | Default |
---|---|---|
bt2_max_frag_len |
Maximum fragment length that Bowtie 2 will consider | 200 |
bt2_mode |
Bowtie2 preset | sensitive-local |
var_depth_min |
Minimum read depth for vcftuils.pl varFilter to call variants | 5 |
mapping_cover_min |
Minimum coverage to pass replicon | 50 |
mapping_depth_min |
Minimum depth to pass replicon | 10 |
mapping_mapped_min |
Minimum percentage mapped to pass replicon, largest replicon only | 50 |
outgroup_mod |
Modifier for ingroup/outgroup designation | 2 |
allele_matrix_support |
Minimum ratio of read support to call a SNP in allele matrix | 90 |
allele_matrix_cons |
Minimum % of isolates containing a SNP for it to pass filtering | 95 |
HPC
Option | Description | Default |
---|---|---|
max_retries |
Number of times to resubmit job, usually with more resources | 3 |
queue_size |
Number of jobs to submit to job scheduler | 1000 |
slurm_account |
SLURM account used for job submission | js66 |
A merge run allows you to add additional isolates to an existing RedDog
output. This is preferable over running an entirely
new RedDog
analysis as a merge run does not require existing isolate data to be remapped, decreasing runtime considerably.
To execute a merge run you first need to enable merge mode and provide the existing RedDog
output directory in the Merge
run section of the nextflow.config
, for example:
// Merge run settings
merge_run = true
existing_run_dir = 'existing_output/'
Following this, the new isolate read sets and output directory must be set in the Input and output configuration section:
// Input and output
reads = 'new_reads/*.fastq.gz'
reference = 'data/reference.gbk'
output_dir = 'merged_output/'
It is critical that the same reference and filtering/quality settings are used both the previous and new run. There are a
series of preflight checks performed that compare settings and supplied reference between runs. If any inconsistency is
detected, these will be indicated and the pipeline will abort. These inconsistencies can be ignored by setting
merge_ignore_errors = true
.
At the conclusion of a merge run the output directory (merged_output/
in the above snippet) will contain all data from the
previous run (from existing_output/
) and data generated from the new read sets. Once the contents of the new output files
have been checked, the old output directory can be deleted.
It is recommended that RedDog
is launched inside a persistent terminal session such as screen
as the pipeline may run
over several hours or days. In a normal setting without a persistent terminal, closing the terminal or ssh session running
RedDog
would cause the pipeline to also quit. Executing RedDog
in a screen
session allows you to close the terminal or
ssh session and have the pipeline continue in the background. Here is a brief example:
# Open a new screen session and launch the pipeline
screen -R reddog_run
./reddog.nf
# Detach from the screen session with C-a (ctrl+a)
# List active screen sessions
screen -ls
# Reattach to the screen
screen -R reddog_run
Pipeline resume is available through the nextflow framework and enabled by specifying -resume
on the command line. To avoid
potentially overwriting existing analysis, RedDog
will refuse to write to an output directory if it already exists unless
you explicitly allow it to do so with --force
. To resume a run you must provide both arguments:
./reddog.nf -resume --force
The nextflow framework stores intermediate files in a directory called work/
. This directory can become very large and
should be deleted once the RedDog
run has completed.
When running RedDog
on M3 you must explicitly set the executor. In the nextflow framework, the executor manages how jobs are
run and the default is to use the standard
executor, which launches jobs locally. However on M3 jobs should never be run
locally and instead submitted to the SLURM queue. This is done in RedDog
by using the massive
executor:
./reddog.nf -profile massive
The pipeline is configured to dynamicly select the optimal partition and QoS for each submitted job. Selection is conditioned on the walltime of each job:
Walltime | QoS | Partition |
---|---|---|
0-30 minutes | genomics | comp, genomics, short |
31-240 minutes | genomics | comp, genomics |
241+ minutes | normal | comp |
Additionally, resources requested for each task are defined in config/slurm_job.config
. By default, when a job fails it is
resubmitted two more times but with increased resources. Where a job fails three times due to insufficient resources, you can
manually specify the required resources in config/slurm_job.config
.
The genomics
queue limits the number of concurrently queued or running jobs and once this limit is reached, any further job
that is submitted is rejected. To avoid job rejected in this context, the pipeline will only submit 200 tasks to the queue at
any one time. This can be adjusted through the queue_size
option in nextflow.config
.
For each replicon in the provided reference, the following outputs are generated:
File | Name |
---|---|
Mapping statistics | reference-name _replicon _mapping_stats.tsv |
SNP alleles | reference-name _replicon _alleles.csv |
Conserved SNP alleles | reference-name _replicon _alleles_cons0.95.csv |
Alignment of conserved SNPs | reference-name _replicon _cons0.95.mfasta |
Phylogeny of conserved SNPs | reference-name _replicon _cons0.95.tree |
Consequences of conserved SNPs | reference-name _replicon _consequences_cons0.95.tsv |
Gene coverage | reference-name _gene_coverage.csv |
Gene depth | reference-name _gene_depth.csv |
Additionally several directory outputs are created:
Directory | Location and name |
---|---|
Filtered isolate BAMs | output_dir /bams/ |
Filtered isolate VCFs | output_dir /vcfs/ |
Run information | output_dir /run_info/ |
Read quality reports (optional) | output_dir /fastqc/ |
Column name | Description |
---|---|
isolate | Isolate name |
replicon_coverage | Coverage of replicon for positions >=1 depth |
replicon_average_depth | Average depth for positions >=1 depth |
replicon_reads_mapped | Proportion of all mapped reads belonging to this replicon |
total_mapped_reads | Proportion of reads mapped to any replicon |
total_reads | Count of total mapped reads |
snps | Number of homozygous SNPs |
snps_heterozygous | Number of heterozygous SNPs |
indels | Number of indels |
pass_fail | Pass or fail classification based on mapping statistics |
phylogeny_group | Ingroup or outgroup designation |
Column name | Description |
---|---|
Pos | Nucleotide position in reference |
Reference | Reference nucleotide (+ve strand) |
isolate columns |
Isolate columns |
Column name | Description |
---|---|
position | Nucleotide position in reference |
ref | Reference nucleotide (+ve strand) |
alt | Alternative nucleotide (+ve strand) |
strand | Encoding strand of gene |
change_type | Mutation type |
gene | Gene name |
ref_codon | Reference codon |
alt_codon | Alternative codon |
ref_aa | Reference amino acid |
alt_aa | Alternative amino acid |
gene_product | Description of encoded protein |
gene_nucleotide_position | Nucleotide position of mutation in gene |
gene_codon_position | Position of affected codon in gene |
codon_nucleotide_position | Nucleotide position of mutation in codon |
notes | Misc. notes |
Column name | Description |
---|---|
replicon | Name of replicon |
locus_tag | Gene locus tag |
isolate columns |
Isolate columns |
The following software are required:
bcftools
, version ≥1.9biopython
, version ≥1.76bowtie2
, version ≥2.4.1fastqc
, version ≥0.11.9fasttree
, version ≥2.1.10multiqc
, version ≥1.7nextflow
, version ≥20.10.0openjdk
, version ≥8.0samtools
, version ≥1.9seqtk
, version ≥1.3
The easiest way to satisfy these dependencies is to use conda
and the list of dependencies in
config/conda_dependencies.txt
:
# First ensure you're in the top level directory of the git repo
conda create -c bioconda -c conda-forge -p $(pwd -P)/conda_env --yes --file config/conda_dependencies.txt
conda activate $(pwd -P)/conda_env
I've written both end-to-end tests and unit/functional tests. These are designed to detect regressions, test effect of modifications, and check compatibility of installed software. To run these tests first provision requirements as shown in Requirements.
# Unit tests and functional tests, run from the git repo top level directory
python3 -m unittest
# End-to-end test
cd tests/simulated_data/
mkdir -p reads/
# Create reference and simulate reads with known mutations
./scripts/generate_reference.py > data/reference.gbk
./scripts/simulate_reads.py --reference_fp data/reference.gbk --spec_fp data/dataset_specification.tsv --output_dir reads/
# Run tests
../../reddog.nf
# Check outputs
./scripts/compare_data.py --run_dir output/ --spec_fp data/dataset_specification.tsv --test_data_dir data/run_data/