Code for processing and analyzing amplicon data

This repo contains code and metadata for analyzing amplicon sequencing data.

The amplicon dataset is described here.
A workflow based on the dada2 pipeline is implemented in Snakemake to process the raw sequence data. The workflow implementation is described in detail here.
Code for statistical analysis of the dataset following processing with dada2 is summarized here.

Reproducing our results

This pipeline runs on the ERDC Carpenter HPC cluster. It should be reproducible in other similar environments with relatively minimal modifications. Here are the basic steps you will need to follow.

Ensure snakemake is installed

You need to have a suitable version of snakemake installed. If you already have this from a previous project, there is no need to reinstall it. On some HPC systems, you may be able to skip installing it yourself and instead load an environment module that makes snakemake available to you.

If you do need to install snakemake, conda or (probably better) mamba are recommended for this. There is a yaml file at envs/snakemake-7.25.0.yaml that should provide specifications for a suitable conda environment. Something like this should do the trick:

mamba env create -f envs/snakemake-7.25.0.yaml

If you need to install conda/mamba, the Miniforge distribution suggested on the mamba installation page should work just fine.

Clone the code repository

The code respository currently lives on the ERDC GitLab. Since you are reading this readme file, you have probably found it already. If you have not done so, use git clone to copy the repo to the cluster or other computer that you want to run it on.

Add sequence data

The required raw sequence data files are described at data/readme.md. These files are too big to be stored in the code repository. (The sample metadata, however, are provided with the repo.) The sequence files need to be obtained separately and placed in the data folder following the instructions in data/readme.md. The checksums for the downloaded files should be compared against the values in data/md5sums.txt.

For now, the sequence data files are available in the Soil Micro shared drive at datasets/.... Eventually, these files will likely be made available for public download. Download details will be provided here, and shell commands to download the data at scripts/download.sh.

Add reference databases

Reference databases for taxonomy assignment should be located in databases.

These databases are not included in the GitHub repo and should be downloaded according to the instructions in databases/readme.md. The checksums for the downloaded files should be compared against the values in databases/md5sums.txt.

Shell commands to download the data are included at scripts/download_databases.sh.

Install SEPP

Most of the required software can be downloaded and installed automatically by snakemake, but this is not the case for SEPP. Use the shell script at scripts/install_sepp.sh to install this software:

bash scripts/install_sepp.sh

Use snakemake to create conda environments

The Carpenter HPC cluster has no internet connectivity from the compute nodes, so snakemake cannot create download software and create conda environments on the fly while running the pipeline. They will need to be installed first on the login nodes:

conda activate snakemake-7.25.0
snakemake -c 1 --use-conda --conda-create-envs-only all

You can skip this step if your computing environment permits internet connection during a compute job, or if you are supplying the software in some other way.

Run pipeline

You should now be in a position to run the pipeline:

conda activate snakemake-7.25.0
snakemake -c 192 --use-conda all

If you are running this on a cluster, you probably want to run it in a non-interactive compute job. The script at scripts/snakemake.pbs provides an example for how this could be done on a PBS system.

You might also want to run part of the pipeline at a time in order to review results along the way. For example:

conda activate snakemake-7.25.0
snakemake -c 192 --use-conda all_demultiplex_qc

Code reuse

The code in this repo is intended for general use with amplicon sequencing projects and is organized to facilitate reuse subject to the included license. In particular, our implementation of the dada2 pipeline should be useable with similar 16S/ITS datasets by simply forking the repo and updating the parameters in the configuration file config/config.yaml (e.g. names of input files, primer sequences, parameters used to control scripts, etc). However, the code may require customization depending on the specifics of the dataset -- e.g. file naming, file formatting, different genetic markers sequenced, input files organized differently, other processing steps such as decontamination, etc. The modular organization of the Snakemake workflow should facilitate such customization. If customization is required, the snakefile and/or the R code files should be updated as necessary.

Created by Stacey Doherty based on original content from Chris Baker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code for processing and analyzing amplicon data

Reproducing our results

Ensure snakemake is installed

Clone the code repository

Add sequence data

Add reference databases

Install SEPP

Use snakemake to create conda environments

Run pipeline

Code reuse

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
analysis		analysis
code		code
config		config
data		data
databases		databases
docs		docs
envs		envs
figures		figures
metadata		metadata
out		out
scripts		scripts
software		software
LICENSE		LICENSE
readme.md		readme.md
snakefile		snakefile

License

sljarvis2/al_pf_mix_public

Folders and files

Latest commit

History

Repository files navigation

Code for processing and analyzing amplicon data

Reproducing our results

Ensure snakemake is installed

Clone the code repository

Add sequence data

Add reference databases

Install SEPP

Use snakemake to create conda environments

Run pipeline

Code reuse

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages