This pipeline is intended to emulate artMAP and it's been built by following the protocol described in the artMAP paper (please refer specifically to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6560221/bin/PLD3-3-e00146-s002.pdf).
Download this repository to your machine
git clone https://github.com/domenico-simone/mapping-by-sequencing.git
This pipeline relies on a bunch of dependencies, namely:
- snakemake
- FastQC
- trimmomatic
- bwa
- samtools
- bcftools
- bedtools
- snpEff
if you already have these tools installed on your system, please skip to the section Set up a run. If you don't, you can choose one of the following options:
- install them on your own
- install the conda environment provided in the mapping-by-sequencing repository (please follow instructions here)
- if you are working on the UPPMAX/Rackham cluster, all these tools can be loaded as modules (please follow instructions here)
cd mapping-by-sequencing
conda env create -n mbs -f envs/mbs.yaml
This will create a conda environment named mbs
which has to be activated every time you want to use the pipeline with one of this commands (depending on your conda installation):
conda activate mbs
# if the above fails, use this:
source activate mbs
When you're done with the pipeline, you may want to deactivate the conda environment with:
conda deactivate
These are needed to run the workflow down to SNP calling
module load bioinfo-tools
module load snakemake
module load FastQC
module load trimmomatic
module load bwa
module load samtools
module load bcftools
module load BEDTools
module load snpEff
# others
The reference genome in fasta format should be placed in the folder data/reference_genomes
and its file name has to be detailed in the config.yaml
file in the ref_genome
field.
The snpEff db matching the reference genome should be already installed in snpEff and its name should be detailed in the config.yaml
file in the snpEff_db
field.
Example: if your reference genome is included in a file My_ref_genome.fa
and your snpEff database is called Genus_species
, the config.yaml
file should look like:
results: "results"
map_dir: "map"
log_dir: "logs"
tmp_dir: "/tmp"
workdir: "test"
ref_genome: "My_ref_genome.fa"
snpEff_db: "Genus_species"
read_processing:
trimmomatic:
options: "-phred33"
processing_options: "LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:100"
java_cmd: "java"
java_vm_mem: "4G"
threads: 4
File data/datasets.tab
should be filled with data about your samples. Its structure is:
sample sample_type library R1 R2
D1K control 1 D1K_L1_1.fq.gz D1K_L1_2.fq.gz
D2K mutated 1 D2K_L2_1.fq.gz D2K_L2_2.fq.gz
D2K mutated 2 D2K_L3_1.fq.gz D2K_L3_2.fq.gz
D3K mutated 1 D3K_L2_1.fq.gz D3K_L2_2.fq.gz
D4K mutated 1 D4K_L1_1.fq.gz D4K_L1_1.fq.gz
D5K mutated 1 D5K_L2_1.fq.gz D5K_L2_2.fq.gz
where
- sample is the sample name
- sample_type indicates whether the sample is a control (the parental line) or a mutated line
- library takes into account the fact that a sample can have multiple read datasets (libraries). Eg, in the example above, sample D2K has two libraries.
- R1 and R2 are the name of the R1 and R2 file for each library, assuming they are located in the directory
data/reads
.
snakemake -rp -j 10
Please refer to the official snakemake tutorial for further examples on how to run a snakemake workflow.