Merge pull request #47 from bmvdgeijn/development

Development
bmvdgeijn · Sep 4, 2016 · 7b6f6bf · 7b6f6bf
2 parents c2509a4 + 5a1fbd5
commit 7b6f6bf
Show file tree

Hide file tree

Showing 194 changed files with 8,403 additions and 2,309 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,9 +2,15 @@
 __pycache__/
 *.py[cod]
 
+# emacs tmp files
+*~
+
 # C extensions
 *.so
 
+# snakemake files
+.snakemake
+
 # Distribution / packaging
 .Python
 env/

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,31 @@
+Version 0.2 - September 3, 2016
+-----------
+
+Version 0.2 of WASP is a major update to the code,
+especially the mapping code. It fixes several bugs related
+to how paired-end reads are handled. For this reason it is
+strongly recommended that users switch to this version
+of the pipline.
+
+Changes include:
+* re-wrote mapping scripts to make simpler and more modular
+* re-wrote mapping test scripts and added of many tests
+* fixed several mapping pipeline bugs related to paired-end reads
+* find_intersecting_snps.py window size no longer required (is now
+	unlimited)
+* find_intersecting_snps.py can now take HDF5 files as input
+* find_intersecting_snps.py can now consider only haplotypes
+	present in samples, rather than all possible allelic combinations
+	of SNPs overlapping reads.
+* added get_as_counts.py script that outputs allele-specific read
+	counts at all polymorphic SNPs. 
+* snp2h5 now records sample info in output HDF5 files
+* improved speed of many CHT pipeline steps
+* improved stability of CHT dispersion parameter estimation
+* added Snakemake workflows for both mapping and CHT pipelines
+* added qqplot.R script to CHT workflow
+
+
+Version 0.1
+-----------
+Initial version of WASP
diff --git a/CHT/.gitignore b/CHT/.gitignore
@@ -1,5 +1,11 @@
 *.py[cod]
 
+# snakemake files
+.snakemake
+
+# emacs backups
+*~
+
 # C extensions
 *.so
 

diff --git a/CHT/README.md b/CHT/README.md
@@ -93,11 +93,15 @@ the [1000 Genomes website](http://www.1000genomes.org/data#DataAccess).
 
 ##  Workflow
 
-An example workflow is provided in [example_workflow.sh](../example_workflow.sh)
-script. This workflow uses data in the [example_data directory](../example_data).
+We now provide a Snakemake workflow that can be used to run the entire
+CHT pipeline. For more information see the [Snakemake README](README.snakemake.md)
+
+An example workflow in the form of a shell script is also provided in
+[example_workflow.sh](../example_workflow.sh) script. This workflow uses
+data in the [example_data directory](../example_data).
 
 Some of the input files that we used for our paper can be downloaded from 
-[here](http://eqtl.uchicago.edu/histone_mods/haplotype_read_counts/). 
+[here](http://eqtl.uchicago.edu/histone_mods/).
 
 The following steps can be used to generate input files and run the
 Combined Haplotype Test. The examples given below use the example

diff --git a/CHT/README.snakemake.md b/CHT/README.snakemake.md
@@ -0,0 +1,91 @@
+## Snakemake CHT pipeline
+
+[Snakemake](https://bitbucket.org/snakemake/snakemake/wiki/Home) is a
+workflow management system, designed to streamline the execution of
+software pipelines. We now provide a Snakemake rule file that can be
+used to run the entire Combined Haplotype Pipeline.
+
+For a more complete description of Snakemake see the
+[Snakemake tutorial](http://snakemake.bitbucket.org/snakemake-tutorial.html).
+
+## Installing Snakemake
+
+Snakemake requires python3, however the CHT pipeline requires
+python2. For this reason, if you are using
+[Anaconda](https://www.continuum.io/downloads), it is recommended that
+you create a [python3
+environment](http://conda.pydata.org/docs/py2or3.html#create-a-python-3-5-environment). For example you can create a python3.5 Anaconda environment with the following shell command (this only needs to be done once):
+
+        conda create -n py35 python=3.5 anaconda
+
+You can then activate the py35 environment, and install the latest version of
+Snakemake with the following commands:
+
+        source activate py35
+        conda install snakemake
+
+Then when you want to switch back to your default (e.g. python2) environment
+do the following:
+
+        source deactivate
+
+
+## Configuring the CHT pipeline
+
+The rules for the Snakemake tasks are defined in the [Snakefile](Snakefile).
+
+Configuration parameters for this Snakefile are read from the YAML file
+[snake_conf.yaml](snake_conf.yaml).
+
+Before running Snakemake edit this file to specify the location
+of all of the input directories and files that will be used by the pipeline.
+This includes locations of the impute2 SNP files, input BAM files etc.
+
+Importantly you must set `wasp_dir` to point to the location of WASP
+on your system, and set `py2` and `Rscript` to setup the environment
+for python and R (e.g. by modifying your PATH) and call the
+appropriate interpreter.  This is necessary because Snakemake is run
+using python3, but most of the scripts require python2.
+
+
+## Running the CHT pipeline
+
+Snakemake can be run as a single process or on a compute cluster with
+multiple jobs running simultaneuously. To run Snakemake on a single node
+you could do something like the following:
+
+        source activate py35
+        cd $WASP_DIR/CHT
+        snakemake
+
+We provide a script [run_snakemake.sh](run_snakemake.sh) to run Snakemake
+on a SGE compute cluster. You must be in a python3 environment to run this
+script, and the script must be run from a job submission host.
+
+        source activate py35
+        cd $WASP_DIR/CHT
+        ./run_snakemake.sh
+
+It should be possible to make simple modifications to this script to
+run on queue management systems other than SGE (e.g. LSF or Slurm).
+
+
+You should Snakemake from within a [Screen](https://www.gnu.org/software/screen/) virtual terminal or using [nohup](https://en.wikipedia.org/wiki/Nohup) so
+that if you are disconnected from the cluster, Snakemake will continue to run.
+
+At the conclusion of the pipeline, a QQPlot will be generated that summarizes
+the results of the CHT.
+
+
+## Debugging the CHT pipeline
+
+By default Snakemake will write an output and error file for each job
+to your home directory. These files will be named like `snakejob.<rulename>.<job_num>.sh.{e|o}<sge_jobid>`. For example:
+
+   	# contains error output for extract_haplotype_read_counts rule:
+   	snakejob.extract_haplotype_read_counts.13.sh.e4507125
+
+If a rule fails, you should check the appropriate output file to see what
+error occurred. A major benefit of Snakemake is that if you re-run snakemake
+after a job fails it will pickup where it left off.
+