This repository is intended to be used as a supplement for scholarly publication.
As such, this repository is made available “as-is”. Maintenance and support for this repository is not available at this time.
This code was used to generate MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing publication data.
Workflow is designed to evaluate and calculate true errors based on either on Unique Molecular Identifier (UMI) or mapping duplicates clusters.
This workflow is written in snakemake workflow manager.
Error calculation package is written in julia programming language.
Workflow also uses:
-
umi-tools to extract and process UMI sequence information.
-
bwa-mem algorithm for short pair end reads alignment to the reference.
-
sambamba and samtools for aligned data sorting and filtering respectively.
DAG of a UMI-based errors calculation workflow rules:
-
Install
conda
:wget -P miniconda https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && chmod 755 ./miniconda/Miniconda3-latest-Linux-x86_64.sh && ./miniconda/Miniconda3-latest-Linux-x86_64.sh
-
Add path of
miniconda
to.bashrc
if not selected option to add automatically during installation:cd ~/ && pth=$(pwd) && echo "PATH=$PATH:$pth/miniconda3/bin" >> ~/.bashrc
-
Add channels to
conda
:conda config --add channels conda-forge conda config --add channels bioconda conda config --add channels anaconda
-
Install all by running:
bash envs/install_all.sh -n <environment_name>
-
To delete environment run:
conda env remove -n <environment_name> -p <environment_path>
-
To run custom analysis first configure
config.yam
file. -
Activate created environment:
source activate <environment_name>
-
Run
snakemake
:snakemake -j 24 --configfile config.yaml --use-conda
You can provide rule names for snakemake command as follows: snakemake --configfile config.yaml --use-conda -j 24 trim download umi_fidelity
-
To run all tests:
bash tests/run_tests.sh --all --name test_env
New conda env will be created and after tests removed.
-
For further options check help:
bash tests/run_tests.sh --help
Workflow checks if must have parameters were provided. All other if not provided will be used from defaul.yaml
file.
Thus your config file must contain at least these parameters:
MAIN:
INPUT_DIR:
SAMPLES_ID:
TMP_DIR:
OUT_DIR:
LANE_R:
REFERENCE:
-
Provide an existing path to your sequencing data.
MAIN: INPUT_DIR: /some/dir/to/seq/data
-
For workflow to be able to find your sample sequencig files provide prefix / sample name of sequencing file you want to analyse.
MAIN: SAMPLES_ID: sample_1 sample_2 # sample_1_S1_L001_R1_001.fastq.gz
Omit Illumina sequencing file naming suffix. Workflow searches for either *fastq.gz, *fq.gz or *.sfq files.
-
Provide regex to be able to find files split by lanes:
MAIN: LANE_RE: "L\\d\\d\\d_" # This would translate to L001 lane number for example.
Illumina sequencing might split data by separate lanes. It is usually 'L\d' or 'L\d\d\d' where '\d' is an integer.
All files which do match sample id, lane, and read direction will be mergeg to generate single sequencing file which is not split by lanes. -
Default trimming program is
bbduk
. All parameter names listed in configuration file are exactly the same as it would be calling it from CLI. -
To extract UMI from raw reads data you need to provide UMI sequence structure:
UMI_TOOLS: umi_structure: NNNNNNNNNNNNNNNN
To extract UMI from the second paired-end read you need to pass --bc-pattern2 parameter for
umi-tools
UMI_TOOLS: additional_extract_params: " --bc-pattern2=NNNNNNNNNNNNNNNN "
You can also pass other
umi-tools
supported options here as a tring separated by spaces. -
Available parameters for error calculations:
FIDELITY:
-
Filter reads by alignment quality:
alignment_quality: 60
-
Filter bases by base quality:
base_quality: 30 # min quality for mutation to be considered in read. [30]
-
Define minimum cluster size (cluster smaller in size will be discarded from the further analysis):
cluster_size: 3
If
naive
algorithm is selected then this filter option is omitted.
Cluster in this case is defined either by umi sequences + mapping position (umi
algorithm)
or just mapping position (position
algorithm) -
Setting allowed maximum number of errors detected in read:
max_mutations_per_read: 3
-
If true error is found between many clusters it might indicate that reference sequence is not correct.
To omit such errors provide upper threshold:referece_mutation: 0.95 # how many same mutations at position to consider as reference mistake.
If same error was found in 95% of umi cluster cases, consider it as a reference assembly mistake.
-
These two options should not be chaged if using umi-tools for UMI sequences extraction and grouping:
umi_split_string: "_" umi_split_position: 2 # position of umi after spliting readname by umi string. 1 based.
-
You can translate nucleotide mutations by providing these parameters:
genes_csv: Some_File.csv genetic_code: 11
In that case you need to pass a csv file with 'Chromosome,Start,End,Gene,Strand'. Coordinates are one based and inclusive.
With a third option you can tell which table to use for translation (BioSequences.ncbi_trans_table). -
You can also pass additional script parameters:
additional_params: "--algorithm umi"
Supported algorithms: naive, position, umi, paired. More can be check with CLI --help option.
-
To report mutations provide:
report_mutations: True
All found mutations will be reported to a csv file.
-
You can also apply additional filtering on mutations reporting:
min_cluster_counts: 1 # Minimum count of clusters for the mutation to be reported to a single_mutations file. min_mutation_coverage: 1 # Minimum confident coverage for the mutation to be reported to a single_mutations file.
-