MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing

Disclaimer

This repository is intended to be used as a supplement for scholarly publication.
As such, this repository is made available “as-is”. Maintenance and support for this repository is not available at this time.

Description

This code was used to generate MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing publication data.
Workflow is designed to evaluate and calculate true errors based on either on Unique Molecular Identifier (UMI) or mapping duplicates clusters.
This workflow is written in snakemake workflow manager.
Error calculation package is written in julia programming language.
Workflow also uses:

umi-tools to extract and process UMI sequence information.
bwa-mem algorithm for short pair end reads alignment to the reference.
sambamba and samtools for aligned data sorting and filtering respectively.

DAG of a UMI-based errors calculation workflow rules:

Workflow setup

Install conda:

wget -P miniconda https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh &&
chmod 755 ./miniconda/Miniconda3-latest-Linux-x86_64.sh &&
./miniconda/Miniconda3-latest-Linux-x86_64.sh

Add path of miniconda to .bashrc if not selected option to add automatically during installation:
```
cd ~/ && pth=$(pwd) &&
echo "PATH=$PATH:$pth/miniconda3/bin" >> ~/.bashrc
```

Add channels to conda:

conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels anaconda

Install all by running:

bash envs/install_all.sh -n <environment_name>

To delete environment run:

conda env remove -n <environment_name> -p <environment_path>

Usage

To run custom analysis first configure config.yam file.
Activate created environment:
```
source activate <environment_name>
```
Run snakemake:
```
snakemake -j 24 --configfile config.yaml --use-conda
```
You can provide rule names for snakemake command as follows: snakemake --configfile config.yaml --use-conda -j 24 trim download umi_fidelity

Testing

To run all tests:
```
bash tests/run_tests.sh --all --name test_env
```
New conda env will be created and after tests removed.
For further options check help:
```
bash tests/run_tests.sh --help
```

Configuration file explained

Workflow checks if must have parameters were provided. All other if not provided will be used from defaul.yaml file.
Thus your config file must contain at least these parameters:

MAIN:
    INPUT_DIR:
    SAMPLES_ID:
    TMP_DIR:
    OUT_DIR:
    LANE_R:
    REFERENCE:

Configuration file in details

Provide an existing path to your sequencing data.
```
MAIN:
    INPUT_DIR: /some/dir/to/seq/data
```
For workflow to be able to find your sample sequencig files provide prefix / sample name of sequencing file you want to analyse.
```
MAIN:
    SAMPLES_ID: sample_1 sample_2  # sample_1_S1_L001_R1_001.fastq.gz
```
Omit Illumina sequencing file naming suffix. Workflow searches for either *fastq.gz, *fq.gz or *.sfq files.
Provide regex to be able to find files split by lanes:
```
MAIN:
    LANE_RE: "L\\d\\d\\d_"  # This would translate to L001 lane number for example.
```
Illumina sequencing might split data by separate lanes. It is usually 'L\d' or 'L\d\d\d' where '\d' is an integer.
All files which do match sample id, lane, and read direction will be mergeg to generate single sequencing file which is not split by lanes.
Default trimming program is bbduk. All parameter names listed in configuration file are exactly the same as it would be calling it from CLI.
To extract UMI from raw reads data you need to provide UMI sequence structure:
```
UMI_TOOLS:
    umi_structure: NNNNNNNNNNNNNNNN
```
To extract UMI from the second paired-end read you need to pass --bc-pattern2 parameter for umi-tools
```
UMI_TOOLS:
    additional_extract_params: " --bc-pattern2=NNNNNNNNNNNNNNNN "
```
You can also pass other umi-tools supported options here as a tring separated by spaces.
Available parameters for error calculations:
```
FIDELITY:
```
1. Filter reads by alignment quality:
```
alignment_quality: 60
```
2. Filter bases by base quality:
```
base_quality: 30          # min quality for mutation to be considered in read. [30]
```
3. Define minimum cluster size (cluster smaller in size will be discarded from the further analysis):
```
cluster_size: 3
```
  If naive algorithm is selected then this filter option is omitted.
  Cluster in this case is defined either by umi sequences + mapping position (umi algorithm)
  or just mapping position (position algorithm)
4. Setting allowed maximum number of errors detected in read:
```
max_mutations_per_read: 3
```
5. If true error is found between many clusters it might indicate that reference sequence is not correct.
  To omit such errors provide upper threshold:
```
referece_mutation: 0.95   # how many same mutations at position to consider as reference mistake.
```
  If same error was found in 95% of umi cluster cases, consider it as a reference assembly mistake.
6. These two options should not be chaged if using umi-tools for UMI sequences extraction and grouping:
```
umi_split_string: "_"
umi_split_position: 2     # position of umi after spliting readname by umi string. 1 based.
```
7. You can translate nucleotide mutations by providing these parameters:
```
genes_csv: Some_File.csv
genetic_code: 11
```
  In that case you need to pass a csv file with 'Chromosome,Start,End,Gene,Strand'. Coordinates are one based and inclusive.
  With a third option you can tell which table to use for translation (BioSequences.ncbi_trans_table).
8. You can also pass additional script parameters:
```
additional_params: "--algorithm umi"
```
  Supported algorithms: naive, position, umi, paired. More can be check with CLI --help option.
9. To report mutations provide:
```
report_mutations: True
```
  All found mutations will be reported to a csv file.
10. You can also apply additional filtering on mutations reporting:
```
min_cluster_counts: 1     # Minimum count of clusters for the mutation to be reported to a single_mutations file.
min_mutation_coverage: 1  # Minimum confident coverage for the mutation to be reported to a single_mutations file.
```

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
datasets		datasets
dependencies/slimfastq		dependencies/slimfastq
envs		envs
imgs		imgs
scripts		scripts
snakefiles		snakefiles
tests		tests
.gitignore		.gitignore
CONTRIBUTING		CONTRIBUTING
LICENSE		LICENSE
LOG_CONFIG.yaml		LOG_CONFIG.yaml
README.md		README.md
Snakefile		Snakefile
cluster.json		cluster.json
default.yaml		default.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing

Disclaimer

Description

Workflow setup

Usage

Testing

Configuration file explained

Configuration file in details

License

About

Releases

Packages

Languages

License

thermofisher/rnd-it-molbio-fidelity-umi

Folders and files

Latest commit

History

Repository files navigation

MuA-based Molecular Indexing for Rare Mutation Detection by Next-Generation Sequencing

Disclaimer

Description

Workflow setup

Usage

Testing

Configuration file explained

Configuration file in details

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages