Snakemake pipeline for post-processing next-generation circulating methylome data generated by cfMeDIP-seq. This pipeline has been predominantly tested on cfMeDIP-seq data from the Ontario Institute of Cancer Research, but by modifying configuration settings should be adaptable to a wider selection of cfMeDIP-seq derived FASTQs. The pipeline code is developed and maintained by Eric Zhao (eyzhao).
- Install Anaconda. I recommend using the Miniconda installer, which provides a more minimal version of conda without many pre-installed packages. This is because the pipeline uses Snakemake's integration with Conda to automatically build its own conda environments with all the necessary dependencies, so it does not require any pre-installed packages.
- Install Snakemake using Mamba, per the instructions on the Snakemake website.
- Clone this github repository.
- Clone https://github.com/pughlab/ConsensusCruncher into a separate location. If your sequencing protocol uses custom barcode sequences, you should have a text file with these barcode sequences, one per line, which can be input to ConsensusCruncher's
--blist
parameter. - Clone https://github.com/eyzhao/MeDEStrand into a separate location. This is a fork of the original MeDEStrand, with one important change (ability to handle dynamically loading custom BSgenome packages) to make it compatible with the pipeline. Nothing else about it has been modified.
- Locate your reference genome file. In our config file, you will notice that we have constructed a custom genome by merging hg38 with two BACs from Arabidopsis, F19K16 from Arabidopsis Chr1 and F24B22 from Arabidopsis Chr3. This is because OICR made a modification to the original protocol by ligating sequencing adapters to the Arabidopsis BACs so that these sequences are included in the final FASTQ. Doing so allows us to use these Arabidopsis spike in sequences to normalize the human sequences (more on this later). If your sequencing approach does NOT include arabidopsis into the final sequence, then you can simply use a standard human genome build such as hg38.fa.
- Locate your BWA index.
- Point
config.yml
to the necessary assets as described in the section on configuration below. - Run
bash update_conda.sh
in an environment where you have internet access. This automatically installs the conda environments that are in theconda_env
directory. - Install countreg package: this R package unfortunately is not yet on conda, so must be installed manually in a sort of hacky way. Locate your installed conda environments (by default in .snakemake/conda/[hash], as described in Snakemake documentation). Find out which
[hash].yml]
file corresponds with the environment namedcfmedip_r
. Load that environment withconda activate .snakemake/conda/[hash]
. Then runR
to enter the R shell andinstall.packages("countreg", repos="http://R-Forge.R-project.org")
to installcountreg
.
The config.yml
file contains all of they key configuration settings required to run the pipeline and customize it to your dataset. You ideally should not need to modify the snakefile
itself, unless you have a very unique edge case.
Currently, config.yml
is set up as an example. You like likely need to modify much of it to work with your own pipeline. The comment lines in config.yml provide detailed descriptions of what each configuration option means.
For running on a cluster, we recommend setting up a Snakemake profile specific to your cluster. For a guide on how to create a Snakemake profile for your cluster setup, see https://www.sichong.site/2020/02/25/snakemake-and-slurm-how-to-manage-workflow-with-resource-constraint-on-hpc/
To run the pipeline on SLURM, submit the launch.sh file with sbatch launch.sh
.
Below is a schemating showing the key Snakemake rules and how data flows through them.
The final output data of the pipeline are written in Feather and Parquet format, which are storage and I/O efficient formats that can be parsed in a wide variety of standard programming languages. For details, see Apache Arrow.